13
Long homopurine homopyrimidine sequences are characteristic of genes expressed in brain and the pseudoautosomal region Albino Bacolla, Jack R. Collins 1 , Bert Gold 2 , Nadia Chuzhanova 3,4 , Ming Yi 1 , Robert M. Stephens 1 , Stefan Stefanov 2 , Adam Olsh 2 , John P. Jakupciak 5 , Michael Dean 2 , Richard A. Lempicki 6 , David N. Cooper 4 and Robert D. Wells* Institute of Biosciences and Technology, Center for Genome Research, Texas A&M University System Health Science Center, Texas Medical Center, 2121 West Holcombe Blvd, Houston, TX 77030, USA, 1 Advanced Biomedical Computing Center and 2 Laboratory of Genomic Diversity, NCI-Frederick, Frederick, MD 21702, USA, 3 Biostatistics and Bioinformatics Unit, Cardiff University, Cardiff CF14 4XN, UK, 4 Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff CF14 4XN, UK, 5 National Institute of Standards and Technology, DNATechnologies Group, Biotechnology Division, Gaithersburg, MD 20899, USA and 6 Laboratory of Immunopathogenesis and Bioinformatics, SAIC-Frederick, Inc., Frederick, MD 21702, USA Received February 23, 2006; Revised March 13, 2006; Accepted April 20, 2006 ABSTRACT Homo(purinepyrimidine) sequences (RY tracts) with mirror repeat symmetries form stable triplexes that block replication and transcription and promote genetic rearrangements. A systematic search was conducted to map the location of the longest RY tracts in the human genome in order to assess their potential function(s). The 814 RY tracts with 250 uninterrupted base pairs were preferentially clustered in the pseudoautosomal region of the sex chromo- somes and located in the introns of 228 annotated genes whose protein products were associated with functions at the cell membrane. These genes were highly expressed in the brain and particularly in genes associated with susceptibility to mental disorders, such as schizophrenia. The set of 1957 genes harboring the 2886 RY tracts with 100 unin- terrupted base pairs was additionally enriched in proteins associated with phosphorylation, signal transduction, development and morphogenesis. Comparisons of the 250 bp RY tracts in the mouse and chimpanzee genomes indicated that these sequences have mutated faster than the surrounding regions and are longer in humans than in chimpanzees. These results support a role for long RY tracts in promoting recombination and genome diversity during evolution through destabilization of chromosomal DNA, thereby induc- ing repair and mutation. INTRODUCTION Chromosomal DNA exists principally as a right-handed double helix (B-DNA). However, other conformations, such as triplexes, tetraplexes, slipped structures with hairpin loops, left-handed Z-DNA and cruciforms are also known [reviewed in (1–7)]. These alternative (non-B DNA) confor- mations are formed at specific sequence motifs and are there- fore possible only at discrete chromosomal locations. More than 15 genomic disorders, including neurofibromatosis type I, chronic myeloid leukemia, spermatogenetic failure and recurrent constitutional translocations, have recently been associated with rearrangements mediated by recombination between blocks of repetitive DNA (from a few hundred base pairs to several hundred kilobase pairs in length) almost exclu- sively composed of direct repeats (DR) and inverted repeats (IR) [reviewed in (8)]. Since double-strand breaks (DSBs) are localized at hotspots within these blocks in most cases, fac- tors other than the primary DNA sequence must be involved, and indeed the locations of the rearrangement fusions are generally found at sequences known to adopt non-B DNA conformations (8–16). This conclusion is further supported by the finding that translocation frequencies correlate with genetic variations in the general population affecting the stability of the putative non-B DNA conformations at breakpoints (16). *To whom correspondence should be addressed. Tel: +1 713 677 7651; Fax: +1 713 677 7689; Email: [email protected] Ó The Author 2006. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] Nucleic Acids Research, 2006, Vol. 34, No. 9 2663–2675 doi:10.1093/nar/gkl354 Published online May 19, 2006 by guest on October 12, 2015 http://nar.oxfordjournals.org/ Downloaded from

Long homopurine*homopyrimidine sequences are characteristic of genes expressed in brain and the pseudoautosomal region

  • Upload
    cancer

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Long homopurinehomopyrimidine sequencesare characteristic of genes expressed in brainand the pseudoautosomal regionAlbino Bacolla Jack R Collins1 Bert Gold2 Nadia Chuzhanova34 Ming Yi1

Robert M Stephens1 Stefan Stefanov2 Adam Olsh2 John P Jakupciak5

Michael Dean2 Richard A Lempicki6 David N Cooper4 and Robert D Wells

Institute of Biosciences and Technology Center for Genome Research Texas AampM University System Health ScienceCenter Texas Medical Center 2121 West Holcombe Blvd Houston TX 77030 USA 1Advanced BiomedicalComputing Center and 2Laboratory of Genomic Diversity NCI-Frederick Frederick MD 21702 USA 3Biostatisticsand Bioinformatics Unit Cardiff University Cardiff CF14 4XN UK 4Institute of Medical Genetics Cardiff UniversityHeath Park Cardiff CF14 4XN UK 5National Institute of Standards and Technology DNA Technologies GroupBiotechnology Division Gaithersburg MD 20899 USA and 6Laboratory of Immunopathogenesis and BioinformaticsSAIC-Frederick Inc Frederick MD 21702 USA

Received February 23 2006 Revised March 13 2006 Accepted April 20 2006

ABSTRACT

Homo(purinepyrimidine) sequences (RY tracts)with mirror repeat symmetries form stable triplexesthat block replication and transcription and promotegenetic rearrangements A systematic search wasconducted to map the location of the longest RYtracts in the human genome in order to assess theirpotential function(s) The 814 RY tracts with 250uninterrupted base pairs were preferentially clusteredin the pseudoautosomal region of the sex chromo-somes and located in the introns of 228 annotatedgenes whose protein products were associated withfunctions at the cell membrane These genes werehighly expressed in the brain and particularlyin genes associated with susceptibility to mentaldisorders such as schizophrenia The set of 1957genes harboring the 2886 RY tracts with 100 unin-terrupted base pairs was additionally enriched inproteins associated with phosphorylation signaltransduction development and morphogenesisComparisons of the 250 bp RY tracts in themouse and chimpanzee genomes indicated thatthese sequences have mutated faster than thesurrounding regions and are longer in humans thanin chimpanzees These results support a rolefor long RY tracts in promoting recombinationand genome diversity during evolution through

destabilization of chromosomal DNA thereby induc-ing repair and mutation

INTRODUCTION

Chromosomal DNA exists principally as a right-handeddouble helix (B-DNA) However other conformations suchas triplexes tetraplexes slipped structures with hairpinloops left-handed Z-DNA and cruciforms are also known[reviewed in (1ndash7)] These alternative (non-B DNA) confor-mations are formed at specific sequence motifs and are there-fore possible only at discrete chromosomal locations Morethan 15 genomic disorders including neurofibromatosistype I chronic myeloid leukemia spermatogenetic failure andrecurrent constitutional translocations have recently beenassociated with rearrangements mediated by recombinationbetween blocks of repetitive DNA (from a few hundred basepairs to several hundred kilobase pairs in length) almost exclu-sively composed of direct repeats (DR) and inverted repeats(IR) [reviewed in (8)] Since double-strand breaks (DSBs)are localized at hotspots within these blocks in most cases fac-tors other than the primary DNA sequence must be involvedand indeed the locations of the rearrangement fusions aregenerally found at sequences known to adopt non-B DNAconformations (8ndash16) This conclusion is further supportedby the finding that translocation frequencies correlate withgenetic variations in the general population affectingthe stability of the putative non-B DNA conformationsat breakpoints (16)

To whom correspondence should be addressed Tel +1 713 677 7651 Fax +1 713 677 7689 Email rwellsibttamhscedu

The Author 2006 Published by Oxford University Press All rights reserved

The online version of this article has been published under an open access model Users are entitled to use reproduce disseminate or display the open accessversion of this article for non-commercial purposes provided that the original authorship is properly and fully attributed the Journal and Oxford University Pressare attributed as the original place of publication with the correct citation details given if an article is subsequently reproduced or disseminated not in its entirety butonly in part or as a derivative work this must be clearly indicated For commercial re-use please contact journalspermissionsoxfordjournalsorg

Nucleic Acids Research 2006 Vol 34 No 9 2663ndash2675doi101093nargkl354

Published online May 19 2006 by guest on O

ctober 12 2015httpnaroxfordjournalsorg

Dow

nloaded from

Triplex DNA requires homo(purinepyrimidine) sequences(RY tracts) and is stabilized by Hoogsteen hydrogen bondsbetween the purines in the WatsonndashCrick duplex and a thirdstrand in the major groove which may be composed of eitherpyrimidines bound in the parallel orientation (YRY triplexes)or purines (RRY triplexes) bound in the antiparallel orienta-tion (13617ndash19) Specific interactions consist of T-AT andC+-GC triads for YRY triplexes and G-GC and A-ATtriads for RRY triplexes hence mirror repeat symmetrieswithin RY tracts yield fully paired and stable triplexesThe YRY triplexes are additionally stabilized by cytosineprotonation at N3 a process that requires low pH innucleotides but which occurs cooperatively and at neutralpH in clustered cytosines in a long polynucleotide chain(20ndash22) Finally the third strand may be provided by thefolding back of a single RY tract with mirror repeat symme-try (intramolecular triplex) by the interaction between twoRY tracts separated by some distance on the same or ontwo different DNA molecules (intermolecular triplex) or bya single-stranded oligonucleotide composed of either DNAor RNA (61923)

RY tracts are genetically unstable in experimental modelsystems Whereas numerous biological functions have beenattributed to triplexes including the blockage of DNA rep-lication (24ndash26) and the interference with transcription(2627) recent studies have provided revealing insightsFirst studies conducted in live mice mammalian cell culturesand Escherichia coli indicate that RY tracts are highly muta-genic (28ndash30) induce DSBs (3031) and are frequently foundat the breakpoint sites of gross deletions and other rearrange-ments (931) Indeed three types of biochemical and geneticstudies have shown that these genomic instabilities were dueto the non-B conformations adopted by the RY tracts andnot to the tracts in their orthodox right-handed B-form (32)Second long GAATTC repeats in the first intron of the fra-taxin (FXN) gene adopt several structures including triplexand sticky DNA which inhibit transcription of the genethus reducing the expression of frataxin (263334) Hencetriplexes may be involved in the etiology of Friedreichrsquosataxia Third triplexes may play critical roles in the bcl-2 major breakpoint region with respect to the RAG-dependentt(1418) translocation associated with follicular lymphomas(35ndash37) Fourth RY tracts elicit a biological response inthe context of mammalian chromatin also in the absence ofperfect mirror repeat symmetry and at moderate-to-shortlengths (20 bp or less) (35ndash38) indicating that the energeticbarrier associated with the duplex-to-triplex structural transi-tions may be easily overcome

Long (several hundred base pairs) RY tracts in the humangenome have been known for gt20 years (6173940) andprevious limited studies suggested their abundance in mam-malian genomes (41ndash43) To date no genome-wide querieswhich might be informative with respect to the potentialfunction(s) of these sequences have been reported Knowl-edge about the size and location of non-B DNA conforma-tions in vertebrate genomes would be expected to givecritical clues as to their biological functions The human gen-ome has so far only been surveyed for IR sequences (44)which can form cruciforms Warburton et al (44) reportedthat the large IR may have a role in testis gene expressionand genome integrity

Herein we have applied a data-mining approach to con-duct the first systematic search for the longest uninterruptedRY tracts in the human genome and tested specific hypothe-ses by employing experimental methodologies in silico Weshow that these sequences cluster specifically in the pseu-doautosomal region of the sex chromosomes and are foundpredominantly in genes that are highly expressed in thebrain These determinations along with comparative analysesof the mouse and chimpanzee genomes indicate that longRY tracts constitute mutational hotspots and are likely tohave played a key role in genome plasticity and evolution

MATERIALS AND METHODS

Computer searches

All RY tracts were found using the program PTRfinder (45)The algorithm first maps the DNA sequence to R or Y forpurine and pyrimidine respectively Then the programlocates tandem repeats with this reduced alphabet andgiven the minimum lengths input by the user The resultswere then imported into the Genomic Resource InformationDatabase (GRID httpgridabccncifcrfgov) (J R CollinsR M Stephens and J Shan manuscript in preparation) forweb access and for generating queries to correlate RY tractswith gene location and overlap Genomic build parametersdescribing tracts of pure R or Y were extracted from theGRID database using the dataset from the Human GenomeBrowser at the University of California Santa Cruz(UCSC) which was based on the hg17 NCBI Build 35May 2004 assembly The following SQL query was used toretrieve all tracts of type R or Y with length gt250 bplsquoSELECT from lsquopupyrsquo WHERE lsquolenrsquogtfrac14250 ANDlength(lsquotypersquo)frac141 ORDER BY lsquochromrsquo lsquolenrsquorsquo This queryretrieved 818 records 814 with fully determined coordinateswhich were imported into MSExcel spreadsheets An addi-tional table column with 814 links to the UCSC Human Gen-ome Browser was created using a URL pattern to link eachrecord to the exact genomic position By opening the linkseach record was visually checked for being within or betweengene sequences Thus 241 of the 814 tracts were found tooccur within annotated genes Of the 228 non-redundantRY tract-containing genes information was gathered for156 genes either through OMIM or PubMed (httpwwwncbinlmnihgov)

In order to compare the 814 tracts and their surroundingsequences in human with their orthologous counterparts inchimpanzee (Pan troglodytes) five PERL (Practical Extrac-tion and Reporting Language) scripts were written A stan-dalone BLAT (Blast-like Alignment Tool) program wasdownloaded from httpwwwsoeucscedu~kentsrc andinstalled on a computer with Linux FC3 The human andchimpanzee (NCBI Build 1 version 1 November 2003) gen-omes were downloaded from the UCSC website In additionthe free package CLUSTAL W 183 and the EMBOSSapplication DOTMATCHER (Ian Longden Sanger InstituteCambridge UK) which is under GNU license (httpwwwgnuorg) were downloaded and installed

The first PERL script lsquo1_queryTOfastaplrsquo processed the814-record table containing the repeat data Coordinateswere transformed so as to include 2500 bp of flanking

2664 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

sequences from both sides of each tract For each recordthe respective DNA sequence was extracted from thelocal human genome files and added to a file in FASTAformat containing all records A second PERL scriptlsquo2_saBLATfastaVSchromosomesplrsquo automated the BLASTsearch of every record against the local chimpanzee genomeTo speed up this process a previously made chimpanzeelsquo11oocrsquo R or Y tracts file was used (for details see BLATFAQs and tutorials at the website previously cited) The out-put was a directory with BLAT-hits results in chimpanzee inPLSX table format one file for each record A third PERLscript lsquoselectFROMfileplrsquo processed the PLSX-files directorypicking up the best score from each file and merging all thebest-scores into a single SELECT file which also contained814 records A fourth PERL script lsquo4_plsxTOfastaCHRplrsquoprocessed the 814-record SELECT table in the same waythat the first script processed the GRID-database table Itextracted from the local chimpanzee genome each corre-sponding segment of DNA sequence and added them to aFASTA file The product of running the first through fourthscripts are two multiple entry DNA FASTA files one withhuman queries and the other with chimpanzee matchesA fifth script lsquo5_html_uplrsquo compared all query and matchsequences from the two files above It also ran CLUSTAL Wand DOTMATCHER for each of the pairs of records Sub-sequently it created an HTML document presenting theresults For ease in troubleshooting the job was performedin several steps using separated scripts even though itcould readily be integrated into one The results for theRY tracts present within genes have been posted at thefollowing website httphomencifcrfgovccrlgddean_labpure_R_or_Y_Comparison

Similarly 5 kb human DNA fragments containing the RYtracts centrally were used as queries in BLAT (httpgenomeucsceducgi-binhgBlat) searches on the mouse (Mus muscu-lus Mm5 NCBI build 33 May 2004) macaque (Macacamulatta NCBI Mmul_01 January 2005) and dog (Canisfamiliaris NCBI v10 July 2004) genomes Genomic inter-spersed repeats were identified by RepeatMasker (httpwwwrepeatmaskerorg)

Tissue expression of RY tract-containing genes

Tissue gene expression data were downloaded from theGenomic Institute of the Novartis Research Foundation(GNF httpwebgnforgindexshtml) and comprised theAffymetrix U133A chip (22 000 probe sets) plus a customGNF Affymetrix chip with 11 000 probe sets analyzed on79 human tissues and cell lines Approximately 17 000known human genes were mapped to these probe sets andtheir relative expression levels were analyzed in the 79 tissuetypes to identify genes preferentially expressed in each tissuePreferential expressed genes are defined as follows theexpression levels of a given gene across all tissues were trans-formed to a Z-score and genes with a Z-score gt 1 (ie gt1 SDabove the mean) in a given tissue were classified as highlyexpressed in that tissue The Z-score was averaged whenmore than one probe set hybridized to the same gene andfor duplicate tissues A total of 1 390 000 Z-score valuesfor all genes and all tissues were obtained 10 of whichqualified as highly expressed P-values were obtained by

comparing the fraction of the 17 000 genes highlyexpressed in a given tissue with the fraction observed forthe set of genes containing either the pure RY tracts gt250bp or the pure RY tracts gt100 bp in length assuming abinomial distribution P-values were corrected (Bonferroni)for multiple comparison of 79 tissues

Functional category enrichment analysis and creation ofgene-term association networks (GTANs)

Functional category enrichment analyses were performedusing a software tool WholePathwayScope (WPS) (46)developed at the ABCC (NCI-Frederick MD) Briefly thisanalysis is based on Fisher Exact Test for 2 middot 2 contingencytables (gene list versus functional category) to estimate andrank the statistical significance of the enrichment of func-tional categories within a given system (GO terms BioCartapathways KEGG pathways gene-disease associationsprotein interaction partners and protein families) for genesharboring gt250 pure RY tracts (RY250) or gt100 pureRY tracts (RY100) The databases used which were col-lected in the WPS database were as follows the GO termsand gene-GO term association tables were downloaded fromthe Gene Ontology Consortium (httpwwwgeneontologyorg) the BioCarta pathways were kindly provided by theCGAP group (Cancer Genome Anatomy Project httpcgapncinihgov) which originated from the Bio Carta path-way collections (httpwwwbiocartacomgenesallPathwaysasp) the KEGG pathways were downloaded from the KEGGpathway collections (Kyoto Encyclopedia of Genes and Gen-omes httpwwwgenomeadjpkegg) the gene disease asso-ciations were downloaded from the Genetic AssociationDatabase (httpgeneticassociationdbnihgov) informationon Protein Interaction Partners and Protein families (Pfam)information was kindly provided by the DAVID (Databasefor Annotation Visualization and Integrated Discovery)group (httpdavidniaidnihgov) Gene-Term AssociationNetworks (GTANs) were created within WPS such that forany given gene in a gene list (RY100 or RY250) its associ-ated term(s) was sought in the WPS database The pairwisegene-term relationships were represented as a graphical lay-out of a Gene-Term Association Network within WPS inwhich any gene and its associated term were linked with anedge to indicate the gene-term association relationship

Gene size

The sizes for annotated genes were obtained from the lengthsof pre-mRNA transcripts which were downloaded fromthe National Center for Biotechnology Information (NCBI)website (ftpftpncbinihgovgenomesH_sapiensRNA)The set of 1377 genes used to determine the size distributionof genes expressed in brain was obtained from the tissue distri-bution analysis with genes with a Z-score gt 1 in brain tissuesbeing selected In most cases the SigmaPlot 2002 programversion 802 (SPSS Chicago IL) was used to represent theresults graphically and to determine the best fit to the data

Clustering of RY tracts

The order statistics r-scans as described in Karlin andMacken (47) were used to detect significant clustering

Nucleic Acids Research 2006 Vol 34 No 9 2665

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

of the RY tracts observed along each chromosome bycomparing their distribution with that of a uniform Poissondistribution We assume that Xi are the distances betweenadjacent RY tracts (n tracts in total) along an individualchromosome that are not necessarily independently and ident-ically distributed (iid) Note that all points are mapped into a[01] interval Denote by

YethrTHORNi frac14

Xithornr1

jfrac141

Xjsbquo i frac14 1sbquo sbquon r thorn 1sbquor n

the distance from the j-th RY tract to its r-th nearest neigh-bor There is an order statistic associated with the sequence ofpartial sums

YethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1

and a minimum defined as

methrTHORN frac14 Y1 frac14 min fY

ethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1g

If we assume that distances Xi are iid then according toKarlin and Macken (47)

Pr methrTHORN ltx

n1thorneth1rTHORN

1 exp eth lTHORNsbquo l frac14 xr

rsbquor gt 1

We declare significant clustering when the observed mini-mum for the limit distribution has lt001 (or lt005) probabil-ity of occurring by chance and therefore

methrTHORN lt1

n

reth ln 099THORNn

1r

Several RY tract clusters were detected (after correctingfor adjacent tracts separated by single nucleotide interrup-tions) by testing the first minimum of r-scans (where r frac141 16)

Complexity analysis

Complexity analysis as devised by Gusev et al (48) wasemployed in a search for DR IR and mirror repeats (MR)in the pseudoautosomal region (PAR1) and the downstream44 Mb region (After_PAR) in the 5 kb sequences flankingthe RY tracts in the PRKCB1 and ADAM18 genes in the30-untranslated region (30-UTR) of the DISC1 gene and inthe comparative analyses of the RY tracts in the mammalianTIAM1 gene

RESULTS

RY tracts within genes

The human genome was screened for the largest RY tractswith the expectation that these could shed light on theirpotential biological role(s) The total number of tracts equal-ing or exceeding an arbitrary length cutoff of 250 pure R(adenine A and guanine G) or Y (cytosine C and thymineT) bases was 814 Approximately 30 (241814) werelocated within annotated genes (228 some genes containedmore than one tract Supplementary Table 1) all withinintrons The longest tract (1303 bp) was present in the

CENTA1 gene on chromosome 7 which was the only pureintragenic sequence to exceed 1 kb Tracts were then ana-lyzed to determine the RY tract length if occasional inter-ruptions (mostly single nt changes) were ignored The25 kb PKD1 RY tract (49) which has been instrumentalin the characterization of the mutational properties of RYsequences (92532) was the third longest preceded onlyby a 31 kb tract in the DKFZp434G0625 gene and a tractof nearly 4 kb in the LOC348094 gene (both encoding prod-ucts of unknown function) Hence the PKD1 gene possessesthe longest gene-associated RY tract in the human genomefor genes of known function A double logarithmic plot ofthese tracts arranged in order of size exhibited a linear rela-tionship (y frac14 3499 ndash 0458x r2 frac14 099 data not shown)indicating an exponential decay distribution This distributionsuggests that selection against or in favor of any RY tractlength is not exercised to a detectable level within the exist-ing envelope

Complexity analysis was employed to identify MR sepa-rated by any distance within these RY tracts The averagefractions of non-redundant MR elements (with no interrup-tions) per RY tract were 16 4 and 2 for MR lengths of10 20 and 30 bp respectively In addition all RY tractshad at least one 10 bp-long MR element Hence based onthe work conducted on the PKD1 and other RY tractsin vivo (928303235375051) we conclude that most ifnot all RY tracts with gt250 uninterrupted base pairs havethe potential to form intramolecular andor intermolecular tri-plexes in vivo However future work will be required toexperimentally evaluate triplex formation by these RYtracts

To determine whether there was preferential associationwithin certain categories of genes we compared the set ofRY tract-containing genes with that of the human (refer-ence) dataset in proteomic databases (Supplementary Table 2)For the Gene Ontology (GO) Molecular Function eightterms exceeded a P-value of 104 and included seven termsfor channel activities and one term for glutamate receptoractivity consistent with strong enrichment for genes encod-ing transmembrane proteins in the brain (SupplementaryTable 2A) The GO Biological Process analysis revealed13 terms which exceeded P-values of 104 and includedthree terms for cell adhesion and cell communication fourfor neuronal function and three for ion transport (Supplemen-tary Table 2B) indicative of a preferential distribution withingenes involved in specialized functions at the cell membraneThe GO Cellular Component analysis (SupplementaryTable 2C) yielded four terms that exceeded P-values of104 and which were associated with localization to the cellmembrane

The fibronectin type III C-cadherin and a-catenindomains present in many cell surface receptors and celladhesion molecules represented the most highly enrichedcategories (P-values 104) in the Protein Families (Pfam)and Protein Information Resource (PIR) databases (Supple-mentary Tables 2D and 2E) whereas eight terms includingthe large neuroactive ligand interaction pathway which com-prises an array of neuronal receptors implicated in neuropep-tide and small molecule signaling pathways were found to beenriched in the more limited BioCarta and Kyoto Encyclope-dia of Genes and Genomes databases (Supplementary

2666 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

Tables 2F and 2G) Analysis of the Protein InteractionPartners database (Supplementary Table 2H) indicated3491 enrichments led by DLG4 (psd-95 P 8 middot 104) acomponent of the post-synaptic density structure involvedin receptor clustering Finally analysis of the DiseaseAssociation database indicated that the most significantenrichment (P 5 middot 103) was for candidate genes forschizophrenia susceptibility (Supplementary Table 2I) Insummary the longest RY tracts were non-randomly dis-tributed and were disproportionately associated withgenes whose products are localized to the plasma membraneand which perform cell communication and transportfunctions

Distinct gene categories

The gt250 bp RY tract-containing genes represented 1of the annotated human gene dataset We therefore soughtto determine whether lowering the RY length thresholdwould yield a larger set of genes with the same non-randomdistribution profile A search for genes with known functionscontaining pure RY tracts gt100 bp in length yielded a totalof 2886 hits and a non-redundant gene set of 1957 Com-parison of the distribution of these RY tract-containinggenes with that of the reference dataset (SupplementaryTable 3) revealed that the terms most highly enriched werecommon to those identified for the genes containing thegt250 RY tracts (Table 1) This correlation was particularlystriking for gene products involved in signal transduction

pathways at synapses which also showed strong associa-tions with susceptibility to schizophrenia (SupplementaryFigures 1 and 2 Supplementary Table 4 and SupplementaryText)

Since P-values are sensitive to sample size we next deter-mined whether the terms were more enriched in gt250 orgt100 bp RY tract-containing genes based on lsquofold-enrichmentrsquo Consistently greater enrichments were notedfor the gt250 bp RY tract-containing genes (Table 1)Hence as the RY tract length increases so does the proba-bility of their association with genes involved in specificfunctions at the plasma membrane

A second set of terms was found to be highly enriched inthe gt100 bp RY tract-containing genes but not in the gt250bp RY tract-containing genes These included terms contain-ing transferase and kinase activities (GO_MF) and proteinreceptor phosphorylation of genes implicated in developmentand morphogenesis (GO_BP) implying an enrichment ingenes involved in signal transduction pathways (Supplemen-tary Figure 3 and Supplementary Text)

Hence we conclude that the longest RY tracts in thehuman genome are distributed between two main pools ofgenes in a length-dependent fashion The first group com-prising shorter length tracts (100 bp lt RY lt 250 bp) co-localized preferentially with genes involved in signal trans-duction pathways associated with development whereas asecond group comprising longer tracts (RY gt 250 bp)co-localized with genes mostly involved in ion transportcell adhesion and neurogenesis Since several terms exceeded

Table 1 Major enriched categories for genes with gt 250 bp and gt 100 bp RY

Combined Rank Category (termmdashpathway) P-value Fold enrichmentgt250 RY gt100 RY gt250 RY gt100 RY

GO molecuar function10 Ion channel activity 195E05 592E09 44 2212 Protein binding 314E03 625E15 17 1620 Glutamate receptor activity 611E04 192E07 103 43

GO biological process5 Cell adhesion 111E04 336E12 34 227 Cell communication 219E04 524E15 16 14

15 Transmission of nerve impulse 183E04 524E08 51 25GO cellular component

2 Membrane 278E06 246E14 16 1315 Synapse 218E02 769E05 50 2819 Extracellular matrix 496E02 133E08 23 22

Pfam family2 Fibronectin type III 284E05 143E12 51 28

15 C-cadherin 776E05 314E05 170 4317 C2 117E04 395E05 54 21

PIR family3 a-catenin 355E04 120E03 604 94

11 Cadherin 181E02 404E04 95 40KEGG pathway

5 Neuroactive ligandndashreceptor interaction 706E03 252E03 29 156 Phosphatidylinositol signaling system 230E02 187E05 48 267 Choleramdashinfection 126E02 339E03 60 22

Protein interaction partners22 DLG4 798E04 206E03 451 6123 RAC1 834E03 978E04 71 24

Disease association4 Schizophrenia 523E03 233E03 75 26

Combined Rank sum of the two categories ranks from Supplementary Tables 2 and 3 categories with one entry were excluded when more than one categorywith largely redundant gene entries was present only one was chosen

Nucleic Acids Research 2006 Vol 34 No 9 2667

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

P-values of 1010 and the different databases (particularly thelarge GO_MF GO_BP GO_CC and Pfam Family) were inagreement in identifying terms that contained the samegenes the conclusions are strongly supported by the data

Tissue gene expression

To determine whether the gt250 bp RY tract-containinggenes (250-set) and the gt100 bp RY tract-containinggenes (100-set) expressed a greater proportion of transcriptsin specific tissues as compared with the reference dataset(All-set) we examined the Affymetrix U133A tissue expres-sion dataset derived from 79 tissues from the Genomic Insti-tute of the Novartis Research Foundation (GNF) For a giventissue the fraction of highly expressed genes in the 250-set(and 100-set) was then compared with the fraction of highlyexpressed genes in the reference dataset (All-set)

The fractions of highly expressed genes containing RYtracts (250-set and 100-set) were significantly greater thanthe corresponding fractions of the All-set in all brain tissuesexamined in the atrioventricular node of the fetus and theuterus (Figure 1 P-values ranged from 4 middot 102 to lt1 middot1013 for the 100-set and from 2 middot 102 to 6 middot 105 forthe 250-set) These fractions were also consistently moreenriched in genes with longer RY tracts (250-set) thanwith shorter ones (100-set) regardless of the P-values Insummary these data are compatible either with the viewthat RY tracts with lengths gt100 bp (including gt250) co-localize preferentially with genes highly expressed in thebrain or that these tracts may perform a role in mediatinggene transcription in brain tissues

Gene size distributions

Some of the genes expressed in human brain tissues such ascontactin-associated protein-like 2 (CNTNAP2) dystrophin(DMD) and ataxin-2 binding protein (A2BP1) are amongthe largest known each spanning gt15 Mb of genomicDNA We therefore posed the question as to whether the pref-erential finding of RY tracts in brain-expressed genes mightbe associated with larger gene size The size distribution forthe annotated human gene population followed a Gaussiandistribution when the logarithms of gene lengths were ana-lyzed (Figure 2 gray) According to this distribution theaverage gene length is 18 kb (Figure 2mdashmode and meancoincide for Gaussian distributions) In contrast the set of1377 genes highly expressed in the brain yielded a Gaussiandistribution peaking at 43 kb (black) Thus human genespredominantly expressed in brain tissues are on averagemore than twice as long as those from the total gene popula-tion when their log-normal distributions are compared Thisnotwithstanding the size distributions of the RY tract-containing genes were of disproportionately greater sizewith the gt100 bp RY tract-containing gene set peaking at108 kb (green) and the gt250 bp RY tract-containing geneset peaking at 192 kb (red) In addition the gt250 bp RYtract-containing genes displayed a pronounced negativeskewness and hence a non-Gaussian behavior We thereforeconclude that larger human genes also tend to host longerRY tracts

Next we investigated whether larger genes also containeda greater number of RY tracts By analyzing the gt100 bpRY tract-containing gene set for all pure RY tracts gt50

Figure 1 Tissue distribution of RY tract-containing genes The percentage increase in the fraction of RY-containing genes (100-set and 250-set) highly expressed(Z-score gt1) in a given tissue relative to the fraction of all genes highly expressed in the same tissue is displayed on the x-axis Asterisks significantly enriched tissuesas determined by binomial probability after a Bonferroni multiple comparison correction Significance P lt 107 107 lt P lt 005 - not significant

2668 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

bp in length 33 genes were found to harbor between 20 and57 RY tracts (Supplementary Table 5) The size of thesegenes ranged from 38 kb to 23 Mb with a median of800 kb and an average of 910 kb indicating that largergenes also tend to host more RY tracts as predicted

The gene size distributions of Figure 2 also raise the ques-tion as to whether the percent increases in the proportions ofhighly expressed RY tract-containing genes in the brain(Figure 1) could be accounted for by the overall greaterlength of the genes expressed in this organ Consideringthat the distributions PAll PBrain PRY gt 100 and PRY gt 250

(Figure 2 gray black green and red lines respectively) areinterdependent and log-normal (with the exception of PRY gt

250) the predicted ratio increase may be calculated from thefollowing relationshipRethPRY middot PBrainTHORNdxRethPRY middot PAllTHORNdx

sbquo

where PRY is either PRYgt100 or PRYgt250 using the means andstandard deviations that defined the respective curvesThe values obtained for the gt100 and gt250 bp RYtract-containing gene sets were 130 and 132 respectivelyfor the integrals bound by the log gene sizes of 0 and 8(which extends beyond all gene sizes) yielding a predictedpercent increase of 30 For the 250-set all percentageincreases were greater than this value of 30 for the 13brain tissues examined (Figure 1) About 23 of the 100-setpercentage increases in brain tissues were gt30 (up to75) whereas 13 were close to 30 In summary thesedata indicate that the number of RY tract-containing genes(but not the number of RY tracts per kilobase pair)expressed in the brain is greater than expected based ontheir longer lengths

The pseudoautosomal region

The total number of pure RY tracts gt250 bp in the humangenome was 814 We evaluated whether these tracts were

evenly dispersed following a Poisson-like distributionthroughout chromosomes or whether instead they were clus-tered in specific regions By testing each chromosome sepa-rately four small clusters were noted on chromosomes 5 6and 12 each involving 2ndash5 tracts (Supplementary Figure 4)In contrast two large clusters were identified on the X and Ychromosomes of 17 and 11 tracts respectively starting fromthe telomeric p-arm and extending for 62 and 26 Mbrespectively The sequence of the first 26 Mb (pseudoautoso-mal region or PAR1) is identical on both sex chromosomesand is essential for homologous recombination during malemeiosis and chromosome pairing (5253) This promptedspeculation that the RY tracts might play a role in PAR1function by virtue of their structural properties perhaps inconcert with other non-B DNA-forming sequences such asIR DR and MR

A search for pairs of DR IR and MR with an arbitrarylength cut-off of gt62 bp in the first 70 Mb of the X chromo-some revealed a significant (c2 test) over-representation ofDR of both gt62 bp and gt250 bp and IR of gt62 bp in thePAR1 region by comparison with the 44 Mb region that fol-lows (After_PAR P lt 00001) In contrast MR of gt62 bpand IR of gt250 bp were not significantly over-represented(Figure 3 and Supplementary Text) In addition the distribu-tion of MR was characterized by a preponderance of(GAAATTTC)n motifs in PAR1 (1011 in PAR1 and 815in After-PAR) We surveyed the sequence composition ofall DR and IR of gt180 bp to determine whether they wereuniquely represented in the PAR1 region or whether they cor-responded to interspersed elements distributed genome-wideNo significant homology was found to any of the 201 repeatsin PAR1 In contrast 1121 repeats in the After-PAR regionexhibited extensive homologies with other chromosomalsites indeed 911 tracts were identified as LINE1 (79)LINE1LTR (19) or LTR (19) elements Additional largesegments of repetitive DNA were also identified in PAR1Seven (gt1 kb) were closely examined and found to compriseorderly arrays of tandem repeat blocks (TRB) containingmultiple sequence motifs with few interruptions (Supplemen-tary Figure 6) In summary these data indicated that the clus-tered RY tracts gt250 bp form part of a larger family ofrepetitive elements mostly DR of unique composition andshort IR that densely populate the PAR1 region

The (GAAATTTC)n repeats

The recurrence of the (GAAATTTC)n repeat within PAR1was intriguing since this sequence possesses the RY mirrorsymmetry that facilitates triplex-formation (1361718) Inaddition its similarity to the (GAATTC)n motifs whichform extraordinarily stable non-B structures (26345455)raised the question as to whether these structural propertiesmight have been functionally exploited throughout the gen-ome We addressed this issue by searching for RY tractswith mirror symmetry gt30 bp in length (sufficient to yieldstable triplexes) and comparing their frequencies with thoseof all microsatellites of comparable repeat units and totallengths The search for microsatellites with unit lengthsfrom 1 to 29 nt yielded a bimodal distribution with a firstpeak comprising the mono- to penta-nucleotide repeats(Supplementary Figure 5) and a second peak composed of

Figure 2 Relationship between gene size and the presence of RY tracts Thefractions of genes with log gene size falling within 02 (025 for gt100 bp RYtract-containing genes and 03 for gt250 bp RY tract-containing genes) log-intervals plotted as a function of their mean values Gray 22799 human genesblack 1377 brain-specific human genes green 1891 gt100 bp RY tract-containing human genes red 200gt250 bp RY tract-containing human genes

Nucleic Acids Research 2006 Vol 34 No 9 2669

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

Triplex DNA requires homo(purinepyrimidine) sequences(RY tracts) and is stabilized by Hoogsteen hydrogen bondsbetween the purines in the WatsonndashCrick duplex and a thirdstrand in the major groove which may be composed of eitherpyrimidines bound in the parallel orientation (YRY triplexes)or purines (RRY triplexes) bound in the antiparallel orienta-tion (13617ndash19) Specific interactions consist of T-AT andC+-GC triads for YRY triplexes and G-GC and A-ATtriads for RRY triplexes hence mirror repeat symmetrieswithin RY tracts yield fully paired and stable triplexesThe YRY triplexes are additionally stabilized by cytosineprotonation at N3 a process that requires low pH innucleotides but which occurs cooperatively and at neutralpH in clustered cytosines in a long polynucleotide chain(20ndash22) Finally the third strand may be provided by thefolding back of a single RY tract with mirror repeat symme-try (intramolecular triplex) by the interaction between twoRY tracts separated by some distance on the same or ontwo different DNA molecules (intermolecular triplex) or bya single-stranded oligonucleotide composed of either DNAor RNA (61923)

RY tracts are genetically unstable in experimental modelsystems Whereas numerous biological functions have beenattributed to triplexes including the blockage of DNA rep-lication (24ndash26) and the interference with transcription(2627) recent studies have provided revealing insightsFirst studies conducted in live mice mammalian cell culturesand Escherichia coli indicate that RY tracts are highly muta-genic (28ndash30) induce DSBs (3031) and are frequently foundat the breakpoint sites of gross deletions and other rearrange-ments (931) Indeed three types of biochemical and geneticstudies have shown that these genomic instabilities were dueto the non-B conformations adopted by the RY tracts andnot to the tracts in their orthodox right-handed B-form (32)Second long GAATTC repeats in the first intron of the fra-taxin (FXN) gene adopt several structures including triplexand sticky DNA which inhibit transcription of the genethus reducing the expression of frataxin (263334) Hencetriplexes may be involved in the etiology of Friedreichrsquosataxia Third triplexes may play critical roles in the bcl-2 major breakpoint region with respect to the RAG-dependentt(1418) translocation associated with follicular lymphomas(35ndash37) Fourth RY tracts elicit a biological response inthe context of mammalian chromatin also in the absence ofperfect mirror repeat symmetry and at moderate-to-shortlengths (20 bp or less) (35ndash38) indicating that the energeticbarrier associated with the duplex-to-triplex structural transi-tions may be easily overcome

Long (several hundred base pairs) RY tracts in the humangenome have been known for gt20 years (6173940) andprevious limited studies suggested their abundance in mam-malian genomes (41ndash43) To date no genome-wide querieswhich might be informative with respect to the potentialfunction(s) of these sequences have been reported Knowl-edge about the size and location of non-B DNA conforma-tions in vertebrate genomes would be expected to givecritical clues as to their biological functions The human gen-ome has so far only been surveyed for IR sequences (44)which can form cruciforms Warburton et al (44) reportedthat the large IR may have a role in testis gene expressionand genome integrity

Herein we have applied a data-mining approach to con-duct the first systematic search for the longest uninterruptedRY tracts in the human genome and tested specific hypothe-ses by employing experimental methodologies in silico Weshow that these sequences cluster specifically in the pseu-doautosomal region of the sex chromosomes and are foundpredominantly in genes that are highly expressed in thebrain These determinations along with comparative analysesof the mouse and chimpanzee genomes indicate that longRY tracts constitute mutational hotspots and are likely tohave played a key role in genome plasticity and evolution

MATERIALS AND METHODS

Computer searches

All RY tracts were found using the program PTRfinder (45)The algorithm first maps the DNA sequence to R or Y forpurine and pyrimidine respectively Then the programlocates tandem repeats with this reduced alphabet andgiven the minimum lengths input by the user The resultswere then imported into the Genomic Resource InformationDatabase (GRID httpgridabccncifcrfgov) (J R CollinsR M Stephens and J Shan manuscript in preparation) forweb access and for generating queries to correlate RY tractswith gene location and overlap Genomic build parametersdescribing tracts of pure R or Y were extracted from theGRID database using the dataset from the Human GenomeBrowser at the University of California Santa Cruz(UCSC) which was based on the hg17 NCBI Build 35May 2004 assembly The following SQL query was used toretrieve all tracts of type R or Y with length gt250 bplsquoSELECT from lsquopupyrsquo WHERE lsquolenrsquogtfrac14250 ANDlength(lsquotypersquo)frac141 ORDER BY lsquochromrsquo lsquolenrsquorsquo This queryretrieved 818 records 814 with fully determined coordinateswhich were imported into MSExcel spreadsheets An addi-tional table column with 814 links to the UCSC Human Gen-ome Browser was created using a URL pattern to link eachrecord to the exact genomic position By opening the linkseach record was visually checked for being within or betweengene sequences Thus 241 of the 814 tracts were found tooccur within annotated genes Of the 228 non-redundantRY tract-containing genes information was gathered for156 genes either through OMIM or PubMed (httpwwwncbinlmnihgov)

In order to compare the 814 tracts and their surroundingsequences in human with their orthologous counterparts inchimpanzee (Pan troglodytes) five PERL (Practical Extrac-tion and Reporting Language) scripts were written A stan-dalone BLAT (Blast-like Alignment Tool) program wasdownloaded from httpwwwsoeucscedu~kentsrc andinstalled on a computer with Linux FC3 The human andchimpanzee (NCBI Build 1 version 1 November 2003) gen-omes were downloaded from the UCSC website In additionthe free package CLUSTAL W 183 and the EMBOSSapplication DOTMATCHER (Ian Longden Sanger InstituteCambridge UK) which is under GNU license (httpwwwgnuorg) were downloaded and installed

The first PERL script lsquo1_queryTOfastaplrsquo processed the814-record table containing the repeat data Coordinateswere transformed so as to include 2500 bp of flanking

2664 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

sequences from both sides of each tract For each recordthe respective DNA sequence was extracted from thelocal human genome files and added to a file in FASTAformat containing all records A second PERL scriptlsquo2_saBLATfastaVSchromosomesplrsquo automated the BLASTsearch of every record against the local chimpanzee genomeTo speed up this process a previously made chimpanzeelsquo11oocrsquo R or Y tracts file was used (for details see BLATFAQs and tutorials at the website previously cited) The out-put was a directory with BLAT-hits results in chimpanzee inPLSX table format one file for each record A third PERLscript lsquoselectFROMfileplrsquo processed the PLSX-files directorypicking up the best score from each file and merging all thebest-scores into a single SELECT file which also contained814 records A fourth PERL script lsquo4_plsxTOfastaCHRplrsquoprocessed the 814-record SELECT table in the same waythat the first script processed the GRID-database table Itextracted from the local chimpanzee genome each corre-sponding segment of DNA sequence and added them to aFASTA file The product of running the first through fourthscripts are two multiple entry DNA FASTA files one withhuman queries and the other with chimpanzee matchesA fifth script lsquo5_html_uplrsquo compared all query and matchsequences from the two files above It also ran CLUSTAL Wand DOTMATCHER for each of the pairs of records Sub-sequently it created an HTML document presenting theresults For ease in troubleshooting the job was performedin several steps using separated scripts even though itcould readily be integrated into one The results for theRY tracts present within genes have been posted at thefollowing website httphomencifcrfgovccrlgddean_labpure_R_or_Y_Comparison

Similarly 5 kb human DNA fragments containing the RYtracts centrally were used as queries in BLAT (httpgenomeucsceducgi-binhgBlat) searches on the mouse (Mus muscu-lus Mm5 NCBI build 33 May 2004) macaque (Macacamulatta NCBI Mmul_01 January 2005) and dog (Canisfamiliaris NCBI v10 July 2004) genomes Genomic inter-spersed repeats were identified by RepeatMasker (httpwwwrepeatmaskerorg)

Tissue expression of RY tract-containing genes

Tissue gene expression data were downloaded from theGenomic Institute of the Novartis Research Foundation(GNF httpwebgnforgindexshtml) and comprised theAffymetrix U133A chip (22 000 probe sets) plus a customGNF Affymetrix chip with 11 000 probe sets analyzed on79 human tissues and cell lines Approximately 17 000known human genes were mapped to these probe sets andtheir relative expression levels were analyzed in the 79 tissuetypes to identify genes preferentially expressed in each tissuePreferential expressed genes are defined as follows theexpression levels of a given gene across all tissues were trans-formed to a Z-score and genes with a Z-score gt 1 (ie gt1 SDabove the mean) in a given tissue were classified as highlyexpressed in that tissue The Z-score was averaged whenmore than one probe set hybridized to the same gene andfor duplicate tissues A total of 1 390 000 Z-score valuesfor all genes and all tissues were obtained 10 of whichqualified as highly expressed P-values were obtained by

comparing the fraction of the 17 000 genes highlyexpressed in a given tissue with the fraction observed forthe set of genes containing either the pure RY tracts gt250bp or the pure RY tracts gt100 bp in length assuming abinomial distribution P-values were corrected (Bonferroni)for multiple comparison of 79 tissues

Functional category enrichment analysis and creation ofgene-term association networks (GTANs)

Functional category enrichment analyses were performedusing a software tool WholePathwayScope (WPS) (46)developed at the ABCC (NCI-Frederick MD) Briefly thisanalysis is based on Fisher Exact Test for 2 middot 2 contingencytables (gene list versus functional category) to estimate andrank the statistical significance of the enrichment of func-tional categories within a given system (GO terms BioCartapathways KEGG pathways gene-disease associationsprotein interaction partners and protein families) for genesharboring gt250 pure RY tracts (RY250) or gt100 pureRY tracts (RY100) The databases used which were col-lected in the WPS database were as follows the GO termsand gene-GO term association tables were downloaded fromthe Gene Ontology Consortium (httpwwwgeneontologyorg) the BioCarta pathways were kindly provided by theCGAP group (Cancer Genome Anatomy Project httpcgapncinihgov) which originated from the Bio Carta path-way collections (httpwwwbiocartacomgenesallPathwaysasp) the KEGG pathways were downloaded from the KEGGpathway collections (Kyoto Encyclopedia of Genes and Gen-omes httpwwwgenomeadjpkegg) the gene disease asso-ciations were downloaded from the Genetic AssociationDatabase (httpgeneticassociationdbnihgov) informationon Protein Interaction Partners and Protein families (Pfam)information was kindly provided by the DAVID (Databasefor Annotation Visualization and Integrated Discovery)group (httpdavidniaidnihgov) Gene-Term AssociationNetworks (GTANs) were created within WPS such that forany given gene in a gene list (RY100 or RY250) its associ-ated term(s) was sought in the WPS database The pairwisegene-term relationships were represented as a graphical lay-out of a Gene-Term Association Network within WPS inwhich any gene and its associated term were linked with anedge to indicate the gene-term association relationship

Gene size

The sizes for annotated genes were obtained from the lengthsof pre-mRNA transcripts which were downloaded fromthe National Center for Biotechnology Information (NCBI)website (ftpftpncbinihgovgenomesH_sapiensRNA)The set of 1377 genes used to determine the size distributionof genes expressed in brain was obtained from the tissue distri-bution analysis with genes with a Z-score gt 1 in brain tissuesbeing selected In most cases the SigmaPlot 2002 programversion 802 (SPSS Chicago IL) was used to represent theresults graphically and to determine the best fit to the data

Clustering of RY tracts

The order statistics r-scans as described in Karlin andMacken (47) were used to detect significant clustering

Nucleic Acids Research 2006 Vol 34 No 9 2665

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

of the RY tracts observed along each chromosome bycomparing their distribution with that of a uniform Poissondistribution We assume that Xi are the distances betweenadjacent RY tracts (n tracts in total) along an individualchromosome that are not necessarily independently and ident-ically distributed (iid) Note that all points are mapped into a[01] interval Denote by

YethrTHORNi frac14

Xithornr1

jfrac141

Xjsbquo i frac14 1sbquo sbquon r thorn 1sbquor n

the distance from the j-th RY tract to its r-th nearest neigh-bor There is an order statistic associated with the sequence ofpartial sums

YethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1

and a minimum defined as

methrTHORN frac14 Y1 frac14 min fY

ethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1g

If we assume that distances Xi are iid then according toKarlin and Macken (47)

Pr methrTHORN ltx

n1thorneth1rTHORN

1 exp eth lTHORNsbquo l frac14 xr

rsbquor gt 1

We declare significant clustering when the observed mini-mum for the limit distribution has lt001 (or lt005) probabil-ity of occurring by chance and therefore

methrTHORN lt1

n

reth ln 099THORNn

1r

Several RY tract clusters were detected (after correctingfor adjacent tracts separated by single nucleotide interrup-tions) by testing the first minimum of r-scans (where r frac141 16)

Complexity analysis

Complexity analysis as devised by Gusev et al (48) wasemployed in a search for DR IR and mirror repeats (MR)in the pseudoautosomal region (PAR1) and the downstream44 Mb region (After_PAR) in the 5 kb sequences flankingthe RY tracts in the PRKCB1 and ADAM18 genes in the30-untranslated region (30-UTR) of the DISC1 gene and inthe comparative analyses of the RY tracts in the mammalianTIAM1 gene

RESULTS

RY tracts within genes

The human genome was screened for the largest RY tractswith the expectation that these could shed light on theirpotential biological role(s) The total number of tracts equal-ing or exceeding an arbitrary length cutoff of 250 pure R(adenine A and guanine G) or Y (cytosine C and thymineT) bases was 814 Approximately 30 (241814) werelocated within annotated genes (228 some genes containedmore than one tract Supplementary Table 1) all withinintrons The longest tract (1303 bp) was present in the

CENTA1 gene on chromosome 7 which was the only pureintragenic sequence to exceed 1 kb Tracts were then ana-lyzed to determine the RY tract length if occasional inter-ruptions (mostly single nt changes) were ignored The25 kb PKD1 RY tract (49) which has been instrumentalin the characterization of the mutational properties of RYsequences (92532) was the third longest preceded onlyby a 31 kb tract in the DKFZp434G0625 gene and a tractof nearly 4 kb in the LOC348094 gene (both encoding prod-ucts of unknown function) Hence the PKD1 gene possessesthe longest gene-associated RY tract in the human genomefor genes of known function A double logarithmic plot ofthese tracts arranged in order of size exhibited a linear rela-tionship (y frac14 3499 ndash 0458x r2 frac14 099 data not shown)indicating an exponential decay distribution This distributionsuggests that selection against or in favor of any RY tractlength is not exercised to a detectable level within the exist-ing envelope

Complexity analysis was employed to identify MR sepa-rated by any distance within these RY tracts The averagefractions of non-redundant MR elements (with no interrup-tions) per RY tract were 16 4 and 2 for MR lengths of10 20 and 30 bp respectively In addition all RY tractshad at least one 10 bp-long MR element Hence based onthe work conducted on the PKD1 and other RY tractsin vivo (928303235375051) we conclude that most ifnot all RY tracts with gt250 uninterrupted base pairs havethe potential to form intramolecular andor intermolecular tri-plexes in vivo However future work will be required toexperimentally evaluate triplex formation by these RYtracts

To determine whether there was preferential associationwithin certain categories of genes we compared the set ofRY tract-containing genes with that of the human (refer-ence) dataset in proteomic databases (Supplementary Table 2)For the Gene Ontology (GO) Molecular Function eightterms exceeded a P-value of 104 and included seven termsfor channel activities and one term for glutamate receptoractivity consistent with strong enrichment for genes encod-ing transmembrane proteins in the brain (SupplementaryTable 2A) The GO Biological Process analysis revealed13 terms which exceeded P-values of 104 and includedthree terms for cell adhesion and cell communication fourfor neuronal function and three for ion transport (Supplemen-tary Table 2B) indicative of a preferential distribution withingenes involved in specialized functions at the cell membraneThe GO Cellular Component analysis (SupplementaryTable 2C) yielded four terms that exceeded P-values of104 and which were associated with localization to the cellmembrane

The fibronectin type III C-cadherin and a-catenindomains present in many cell surface receptors and celladhesion molecules represented the most highly enrichedcategories (P-values 104) in the Protein Families (Pfam)and Protein Information Resource (PIR) databases (Supple-mentary Tables 2D and 2E) whereas eight terms includingthe large neuroactive ligand interaction pathway which com-prises an array of neuronal receptors implicated in neuropep-tide and small molecule signaling pathways were found to beenriched in the more limited BioCarta and Kyoto Encyclope-dia of Genes and Genomes databases (Supplementary

2666 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

Tables 2F and 2G) Analysis of the Protein InteractionPartners database (Supplementary Table 2H) indicated3491 enrichments led by DLG4 (psd-95 P 8 middot 104) acomponent of the post-synaptic density structure involvedin receptor clustering Finally analysis of the DiseaseAssociation database indicated that the most significantenrichment (P 5 middot 103) was for candidate genes forschizophrenia susceptibility (Supplementary Table 2I) Insummary the longest RY tracts were non-randomly dis-tributed and were disproportionately associated withgenes whose products are localized to the plasma membraneand which perform cell communication and transportfunctions

Distinct gene categories

The gt250 bp RY tract-containing genes represented 1of the annotated human gene dataset We therefore soughtto determine whether lowering the RY length thresholdwould yield a larger set of genes with the same non-randomdistribution profile A search for genes with known functionscontaining pure RY tracts gt100 bp in length yielded a totalof 2886 hits and a non-redundant gene set of 1957 Com-parison of the distribution of these RY tract-containinggenes with that of the reference dataset (SupplementaryTable 3) revealed that the terms most highly enriched werecommon to those identified for the genes containing thegt250 RY tracts (Table 1) This correlation was particularlystriking for gene products involved in signal transduction

pathways at synapses which also showed strong associa-tions with susceptibility to schizophrenia (SupplementaryFigures 1 and 2 Supplementary Table 4 and SupplementaryText)

Since P-values are sensitive to sample size we next deter-mined whether the terms were more enriched in gt250 orgt100 bp RY tract-containing genes based on lsquofold-enrichmentrsquo Consistently greater enrichments were notedfor the gt250 bp RY tract-containing genes (Table 1)Hence as the RY tract length increases so does the proba-bility of their association with genes involved in specificfunctions at the plasma membrane

A second set of terms was found to be highly enriched inthe gt100 bp RY tract-containing genes but not in the gt250bp RY tract-containing genes These included terms contain-ing transferase and kinase activities (GO_MF) and proteinreceptor phosphorylation of genes implicated in developmentand morphogenesis (GO_BP) implying an enrichment ingenes involved in signal transduction pathways (Supplemen-tary Figure 3 and Supplementary Text)

Hence we conclude that the longest RY tracts in thehuman genome are distributed between two main pools ofgenes in a length-dependent fashion The first group com-prising shorter length tracts (100 bp lt RY lt 250 bp) co-localized preferentially with genes involved in signal trans-duction pathways associated with development whereas asecond group comprising longer tracts (RY gt 250 bp)co-localized with genes mostly involved in ion transportcell adhesion and neurogenesis Since several terms exceeded

Table 1 Major enriched categories for genes with gt 250 bp and gt 100 bp RY

Combined Rank Category (termmdashpathway) P-value Fold enrichmentgt250 RY gt100 RY gt250 RY gt100 RY

GO molecuar function10 Ion channel activity 195E05 592E09 44 2212 Protein binding 314E03 625E15 17 1620 Glutamate receptor activity 611E04 192E07 103 43

GO biological process5 Cell adhesion 111E04 336E12 34 227 Cell communication 219E04 524E15 16 14

15 Transmission of nerve impulse 183E04 524E08 51 25GO cellular component

2 Membrane 278E06 246E14 16 1315 Synapse 218E02 769E05 50 2819 Extracellular matrix 496E02 133E08 23 22

Pfam family2 Fibronectin type III 284E05 143E12 51 28

15 C-cadherin 776E05 314E05 170 4317 C2 117E04 395E05 54 21

PIR family3 a-catenin 355E04 120E03 604 94

11 Cadherin 181E02 404E04 95 40KEGG pathway

5 Neuroactive ligandndashreceptor interaction 706E03 252E03 29 156 Phosphatidylinositol signaling system 230E02 187E05 48 267 Choleramdashinfection 126E02 339E03 60 22

Protein interaction partners22 DLG4 798E04 206E03 451 6123 RAC1 834E03 978E04 71 24

Disease association4 Schizophrenia 523E03 233E03 75 26

Combined Rank sum of the two categories ranks from Supplementary Tables 2 and 3 categories with one entry were excluded when more than one categorywith largely redundant gene entries was present only one was chosen

Nucleic Acids Research 2006 Vol 34 No 9 2667

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

P-values of 1010 and the different databases (particularly thelarge GO_MF GO_BP GO_CC and Pfam Family) were inagreement in identifying terms that contained the samegenes the conclusions are strongly supported by the data

Tissue gene expression

To determine whether the gt250 bp RY tract-containinggenes (250-set) and the gt100 bp RY tract-containinggenes (100-set) expressed a greater proportion of transcriptsin specific tissues as compared with the reference dataset(All-set) we examined the Affymetrix U133A tissue expres-sion dataset derived from 79 tissues from the Genomic Insti-tute of the Novartis Research Foundation (GNF) For a giventissue the fraction of highly expressed genes in the 250-set(and 100-set) was then compared with the fraction of highlyexpressed genes in the reference dataset (All-set)

The fractions of highly expressed genes containing RYtracts (250-set and 100-set) were significantly greater thanthe corresponding fractions of the All-set in all brain tissuesexamined in the atrioventricular node of the fetus and theuterus (Figure 1 P-values ranged from 4 middot 102 to lt1 middot1013 for the 100-set and from 2 middot 102 to 6 middot 105 forthe 250-set) These fractions were also consistently moreenriched in genes with longer RY tracts (250-set) thanwith shorter ones (100-set) regardless of the P-values Insummary these data are compatible either with the viewthat RY tracts with lengths gt100 bp (including gt250) co-localize preferentially with genes highly expressed in thebrain or that these tracts may perform a role in mediatinggene transcription in brain tissues

Gene size distributions

Some of the genes expressed in human brain tissues such ascontactin-associated protein-like 2 (CNTNAP2) dystrophin(DMD) and ataxin-2 binding protein (A2BP1) are amongthe largest known each spanning gt15 Mb of genomicDNA We therefore posed the question as to whether the pref-erential finding of RY tracts in brain-expressed genes mightbe associated with larger gene size The size distribution forthe annotated human gene population followed a Gaussiandistribution when the logarithms of gene lengths were ana-lyzed (Figure 2 gray) According to this distribution theaverage gene length is 18 kb (Figure 2mdashmode and meancoincide for Gaussian distributions) In contrast the set of1377 genes highly expressed in the brain yielded a Gaussiandistribution peaking at 43 kb (black) Thus human genespredominantly expressed in brain tissues are on averagemore than twice as long as those from the total gene popula-tion when their log-normal distributions are compared Thisnotwithstanding the size distributions of the RY tract-containing genes were of disproportionately greater sizewith the gt100 bp RY tract-containing gene set peaking at108 kb (green) and the gt250 bp RY tract-containing geneset peaking at 192 kb (red) In addition the gt250 bp RYtract-containing genes displayed a pronounced negativeskewness and hence a non-Gaussian behavior We thereforeconclude that larger human genes also tend to host longerRY tracts

Next we investigated whether larger genes also containeda greater number of RY tracts By analyzing the gt100 bpRY tract-containing gene set for all pure RY tracts gt50

Figure 1 Tissue distribution of RY tract-containing genes The percentage increase in the fraction of RY-containing genes (100-set and 250-set) highly expressed(Z-score gt1) in a given tissue relative to the fraction of all genes highly expressed in the same tissue is displayed on the x-axis Asterisks significantly enriched tissuesas determined by binomial probability after a Bonferroni multiple comparison correction Significance P lt 107 107 lt P lt 005 - not significant

2668 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

bp in length 33 genes were found to harbor between 20 and57 RY tracts (Supplementary Table 5) The size of thesegenes ranged from 38 kb to 23 Mb with a median of800 kb and an average of 910 kb indicating that largergenes also tend to host more RY tracts as predicted

The gene size distributions of Figure 2 also raise the ques-tion as to whether the percent increases in the proportions ofhighly expressed RY tract-containing genes in the brain(Figure 1) could be accounted for by the overall greaterlength of the genes expressed in this organ Consideringthat the distributions PAll PBrain PRY gt 100 and PRY gt 250

(Figure 2 gray black green and red lines respectively) areinterdependent and log-normal (with the exception of PRY gt

250) the predicted ratio increase may be calculated from thefollowing relationshipRethPRY middot PBrainTHORNdxRethPRY middot PAllTHORNdx

sbquo

where PRY is either PRYgt100 or PRYgt250 using the means andstandard deviations that defined the respective curvesThe values obtained for the gt100 and gt250 bp RYtract-containing gene sets were 130 and 132 respectivelyfor the integrals bound by the log gene sizes of 0 and 8(which extends beyond all gene sizes) yielding a predictedpercent increase of 30 For the 250-set all percentageincreases were greater than this value of 30 for the 13brain tissues examined (Figure 1) About 23 of the 100-setpercentage increases in brain tissues were gt30 (up to75) whereas 13 were close to 30 In summary thesedata indicate that the number of RY tract-containing genes(but not the number of RY tracts per kilobase pair)expressed in the brain is greater than expected based ontheir longer lengths

The pseudoautosomal region

The total number of pure RY tracts gt250 bp in the humangenome was 814 We evaluated whether these tracts were

evenly dispersed following a Poisson-like distributionthroughout chromosomes or whether instead they were clus-tered in specific regions By testing each chromosome sepa-rately four small clusters were noted on chromosomes 5 6and 12 each involving 2ndash5 tracts (Supplementary Figure 4)In contrast two large clusters were identified on the X and Ychromosomes of 17 and 11 tracts respectively starting fromthe telomeric p-arm and extending for 62 and 26 Mbrespectively The sequence of the first 26 Mb (pseudoautoso-mal region or PAR1) is identical on both sex chromosomesand is essential for homologous recombination during malemeiosis and chromosome pairing (5253) This promptedspeculation that the RY tracts might play a role in PAR1function by virtue of their structural properties perhaps inconcert with other non-B DNA-forming sequences such asIR DR and MR

A search for pairs of DR IR and MR with an arbitrarylength cut-off of gt62 bp in the first 70 Mb of the X chromo-some revealed a significant (c2 test) over-representation ofDR of both gt62 bp and gt250 bp and IR of gt62 bp in thePAR1 region by comparison with the 44 Mb region that fol-lows (After_PAR P lt 00001) In contrast MR of gt62 bpand IR of gt250 bp were not significantly over-represented(Figure 3 and Supplementary Text) In addition the distribu-tion of MR was characterized by a preponderance of(GAAATTTC)n motifs in PAR1 (1011 in PAR1 and 815in After-PAR) We surveyed the sequence composition ofall DR and IR of gt180 bp to determine whether they wereuniquely represented in the PAR1 region or whether they cor-responded to interspersed elements distributed genome-wideNo significant homology was found to any of the 201 repeatsin PAR1 In contrast 1121 repeats in the After-PAR regionexhibited extensive homologies with other chromosomalsites indeed 911 tracts were identified as LINE1 (79)LINE1LTR (19) or LTR (19) elements Additional largesegments of repetitive DNA were also identified in PAR1Seven (gt1 kb) were closely examined and found to compriseorderly arrays of tandem repeat blocks (TRB) containingmultiple sequence motifs with few interruptions (Supplemen-tary Figure 6) In summary these data indicated that the clus-tered RY tracts gt250 bp form part of a larger family ofrepetitive elements mostly DR of unique composition andshort IR that densely populate the PAR1 region

The (GAAATTTC)n repeats

The recurrence of the (GAAATTTC)n repeat within PAR1was intriguing since this sequence possesses the RY mirrorsymmetry that facilitates triplex-formation (1361718) Inaddition its similarity to the (GAATTC)n motifs whichform extraordinarily stable non-B structures (26345455)raised the question as to whether these structural propertiesmight have been functionally exploited throughout the gen-ome We addressed this issue by searching for RY tractswith mirror symmetry gt30 bp in length (sufficient to yieldstable triplexes) and comparing their frequencies with thoseof all microsatellites of comparable repeat units and totallengths The search for microsatellites with unit lengthsfrom 1 to 29 nt yielded a bimodal distribution with a firstpeak comprising the mono- to penta-nucleotide repeats(Supplementary Figure 5) and a second peak composed of

Figure 2 Relationship between gene size and the presence of RY tracts Thefractions of genes with log gene size falling within 02 (025 for gt100 bp RYtract-containing genes and 03 for gt250 bp RY tract-containing genes) log-intervals plotted as a function of their mean values Gray 22799 human genesblack 1377 brain-specific human genes green 1891 gt100 bp RY tract-containing human genes red 200gt250 bp RY tract-containing human genes

Nucleic Acids Research 2006 Vol 34 No 9 2669

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

sequences from both sides of each tract For each recordthe respective DNA sequence was extracted from thelocal human genome files and added to a file in FASTAformat containing all records A second PERL scriptlsquo2_saBLATfastaVSchromosomesplrsquo automated the BLASTsearch of every record against the local chimpanzee genomeTo speed up this process a previously made chimpanzeelsquo11oocrsquo R or Y tracts file was used (for details see BLATFAQs and tutorials at the website previously cited) The out-put was a directory with BLAT-hits results in chimpanzee inPLSX table format one file for each record A third PERLscript lsquoselectFROMfileplrsquo processed the PLSX-files directorypicking up the best score from each file and merging all thebest-scores into a single SELECT file which also contained814 records A fourth PERL script lsquo4_plsxTOfastaCHRplrsquoprocessed the 814-record SELECT table in the same waythat the first script processed the GRID-database table Itextracted from the local chimpanzee genome each corre-sponding segment of DNA sequence and added them to aFASTA file The product of running the first through fourthscripts are two multiple entry DNA FASTA files one withhuman queries and the other with chimpanzee matchesA fifth script lsquo5_html_uplrsquo compared all query and matchsequences from the two files above It also ran CLUSTAL Wand DOTMATCHER for each of the pairs of records Sub-sequently it created an HTML document presenting theresults For ease in troubleshooting the job was performedin several steps using separated scripts even though itcould readily be integrated into one The results for theRY tracts present within genes have been posted at thefollowing website httphomencifcrfgovccrlgddean_labpure_R_or_Y_Comparison

Similarly 5 kb human DNA fragments containing the RYtracts centrally were used as queries in BLAT (httpgenomeucsceducgi-binhgBlat) searches on the mouse (Mus muscu-lus Mm5 NCBI build 33 May 2004) macaque (Macacamulatta NCBI Mmul_01 January 2005) and dog (Canisfamiliaris NCBI v10 July 2004) genomes Genomic inter-spersed repeats were identified by RepeatMasker (httpwwwrepeatmaskerorg)

Tissue expression of RY tract-containing genes

Tissue gene expression data were downloaded from theGenomic Institute of the Novartis Research Foundation(GNF httpwebgnforgindexshtml) and comprised theAffymetrix U133A chip (22 000 probe sets) plus a customGNF Affymetrix chip with 11 000 probe sets analyzed on79 human tissues and cell lines Approximately 17 000known human genes were mapped to these probe sets andtheir relative expression levels were analyzed in the 79 tissuetypes to identify genes preferentially expressed in each tissuePreferential expressed genes are defined as follows theexpression levels of a given gene across all tissues were trans-formed to a Z-score and genes with a Z-score gt 1 (ie gt1 SDabove the mean) in a given tissue were classified as highlyexpressed in that tissue The Z-score was averaged whenmore than one probe set hybridized to the same gene andfor duplicate tissues A total of 1 390 000 Z-score valuesfor all genes and all tissues were obtained 10 of whichqualified as highly expressed P-values were obtained by

comparing the fraction of the 17 000 genes highlyexpressed in a given tissue with the fraction observed forthe set of genes containing either the pure RY tracts gt250bp or the pure RY tracts gt100 bp in length assuming abinomial distribution P-values were corrected (Bonferroni)for multiple comparison of 79 tissues

Functional category enrichment analysis and creation ofgene-term association networks (GTANs)

Functional category enrichment analyses were performedusing a software tool WholePathwayScope (WPS) (46)developed at the ABCC (NCI-Frederick MD) Briefly thisanalysis is based on Fisher Exact Test for 2 middot 2 contingencytables (gene list versus functional category) to estimate andrank the statistical significance of the enrichment of func-tional categories within a given system (GO terms BioCartapathways KEGG pathways gene-disease associationsprotein interaction partners and protein families) for genesharboring gt250 pure RY tracts (RY250) or gt100 pureRY tracts (RY100) The databases used which were col-lected in the WPS database were as follows the GO termsand gene-GO term association tables were downloaded fromthe Gene Ontology Consortium (httpwwwgeneontologyorg) the BioCarta pathways were kindly provided by theCGAP group (Cancer Genome Anatomy Project httpcgapncinihgov) which originated from the Bio Carta path-way collections (httpwwwbiocartacomgenesallPathwaysasp) the KEGG pathways were downloaded from the KEGGpathway collections (Kyoto Encyclopedia of Genes and Gen-omes httpwwwgenomeadjpkegg) the gene disease asso-ciations were downloaded from the Genetic AssociationDatabase (httpgeneticassociationdbnihgov) informationon Protein Interaction Partners and Protein families (Pfam)information was kindly provided by the DAVID (Databasefor Annotation Visualization and Integrated Discovery)group (httpdavidniaidnihgov) Gene-Term AssociationNetworks (GTANs) were created within WPS such that forany given gene in a gene list (RY100 or RY250) its associ-ated term(s) was sought in the WPS database The pairwisegene-term relationships were represented as a graphical lay-out of a Gene-Term Association Network within WPS inwhich any gene and its associated term were linked with anedge to indicate the gene-term association relationship

Gene size

The sizes for annotated genes were obtained from the lengthsof pre-mRNA transcripts which were downloaded fromthe National Center for Biotechnology Information (NCBI)website (ftpftpncbinihgovgenomesH_sapiensRNA)The set of 1377 genes used to determine the size distributionof genes expressed in brain was obtained from the tissue distri-bution analysis with genes with a Z-score gt 1 in brain tissuesbeing selected In most cases the SigmaPlot 2002 programversion 802 (SPSS Chicago IL) was used to represent theresults graphically and to determine the best fit to the data

Clustering of RY tracts

The order statistics r-scans as described in Karlin andMacken (47) were used to detect significant clustering

Nucleic Acids Research 2006 Vol 34 No 9 2665

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

of the RY tracts observed along each chromosome bycomparing their distribution with that of a uniform Poissondistribution We assume that Xi are the distances betweenadjacent RY tracts (n tracts in total) along an individualchromosome that are not necessarily independently and ident-ically distributed (iid) Note that all points are mapped into a[01] interval Denote by

YethrTHORNi frac14

Xithornr1

jfrac141

Xjsbquo i frac14 1sbquo sbquon r thorn 1sbquor n

the distance from the j-th RY tract to its r-th nearest neigh-bor There is an order statistic associated with the sequence ofpartial sums

YethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1

and a minimum defined as

methrTHORN frac14 Y1 frac14 min fY

ethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1g

If we assume that distances Xi are iid then according toKarlin and Macken (47)

Pr methrTHORN ltx

n1thorneth1rTHORN

1 exp eth lTHORNsbquo l frac14 xr

rsbquor gt 1

We declare significant clustering when the observed mini-mum for the limit distribution has lt001 (or lt005) probabil-ity of occurring by chance and therefore

methrTHORN lt1

n

reth ln 099THORNn

1r

Several RY tract clusters were detected (after correctingfor adjacent tracts separated by single nucleotide interrup-tions) by testing the first minimum of r-scans (where r frac141 16)

Complexity analysis

Complexity analysis as devised by Gusev et al (48) wasemployed in a search for DR IR and mirror repeats (MR)in the pseudoautosomal region (PAR1) and the downstream44 Mb region (After_PAR) in the 5 kb sequences flankingthe RY tracts in the PRKCB1 and ADAM18 genes in the30-untranslated region (30-UTR) of the DISC1 gene and inthe comparative analyses of the RY tracts in the mammalianTIAM1 gene

RESULTS

RY tracts within genes

The human genome was screened for the largest RY tractswith the expectation that these could shed light on theirpotential biological role(s) The total number of tracts equal-ing or exceeding an arbitrary length cutoff of 250 pure R(adenine A and guanine G) or Y (cytosine C and thymineT) bases was 814 Approximately 30 (241814) werelocated within annotated genes (228 some genes containedmore than one tract Supplementary Table 1) all withinintrons The longest tract (1303 bp) was present in the

CENTA1 gene on chromosome 7 which was the only pureintragenic sequence to exceed 1 kb Tracts were then ana-lyzed to determine the RY tract length if occasional inter-ruptions (mostly single nt changes) were ignored The25 kb PKD1 RY tract (49) which has been instrumentalin the characterization of the mutational properties of RYsequences (92532) was the third longest preceded onlyby a 31 kb tract in the DKFZp434G0625 gene and a tractof nearly 4 kb in the LOC348094 gene (both encoding prod-ucts of unknown function) Hence the PKD1 gene possessesthe longest gene-associated RY tract in the human genomefor genes of known function A double logarithmic plot ofthese tracts arranged in order of size exhibited a linear rela-tionship (y frac14 3499 ndash 0458x r2 frac14 099 data not shown)indicating an exponential decay distribution This distributionsuggests that selection against or in favor of any RY tractlength is not exercised to a detectable level within the exist-ing envelope

Complexity analysis was employed to identify MR sepa-rated by any distance within these RY tracts The averagefractions of non-redundant MR elements (with no interrup-tions) per RY tract were 16 4 and 2 for MR lengths of10 20 and 30 bp respectively In addition all RY tractshad at least one 10 bp-long MR element Hence based onthe work conducted on the PKD1 and other RY tractsin vivo (928303235375051) we conclude that most ifnot all RY tracts with gt250 uninterrupted base pairs havethe potential to form intramolecular andor intermolecular tri-plexes in vivo However future work will be required toexperimentally evaluate triplex formation by these RYtracts

To determine whether there was preferential associationwithin certain categories of genes we compared the set ofRY tract-containing genes with that of the human (refer-ence) dataset in proteomic databases (Supplementary Table 2)For the Gene Ontology (GO) Molecular Function eightterms exceeded a P-value of 104 and included seven termsfor channel activities and one term for glutamate receptoractivity consistent with strong enrichment for genes encod-ing transmembrane proteins in the brain (SupplementaryTable 2A) The GO Biological Process analysis revealed13 terms which exceeded P-values of 104 and includedthree terms for cell adhesion and cell communication fourfor neuronal function and three for ion transport (Supplemen-tary Table 2B) indicative of a preferential distribution withingenes involved in specialized functions at the cell membraneThe GO Cellular Component analysis (SupplementaryTable 2C) yielded four terms that exceeded P-values of104 and which were associated with localization to the cellmembrane

The fibronectin type III C-cadherin and a-catenindomains present in many cell surface receptors and celladhesion molecules represented the most highly enrichedcategories (P-values 104) in the Protein Families (Pfam)and Protein Information Resource (PIR) databases (Supple-mentary Tables 2D and 2E) whereas eight terms includingthe large neuroactive ligand interaction pathway which com-prises an array of neuronal receptors implicated in neuropep-tide and small molecule signaling pathways were found to beenriched in the more limited BioCarta and Kyoto Encyclope-dia of Genes and Genomes databases (Supplementary

2666 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

Tables 2F and 2G) Analysis of the Protein InteractionPartners database (Supplementary Table 2H) indicated3491 enrichments led by DLG4 (psd-95 P 8 middot 104) acomponent of the post-synaptic density structure involvedin receptor clustering Finally analysis of the DiseaseAssociation database indicated that the most significantenrichment (P 5 middot 103) was for candidate genes forschizophrenia susceptibility (Supplementary Table 2I) Insummary the longest RY tracts were non-randomly dis-tributed and were disproportionately associated withgenes whose products are localized to the plasma membraneand which perform cell communication and transportfunctions

Distinct gene categories

The gt250 bp RY tract-containing genes represented 1of the annotated human gene dataset We therefore soughtto determine whether lowering the RY length thresholdwould yield a larger set of genes with the same non-randomdistribution profile A search for genes with known functionscontaining pure RY tracts gt100 bp in length yielded a totalof 2886 hits and a non-redundant gene set of 1957 Com-parison of the distribution of these RY tract-containinggenes with that of the reference dataset (SupplementaryTable 3) revealed that the terms most highly enriched werecommon to those identified for the genes containing thegt250 RY tracts (Table 1) This correlation was particularlystriking for gene products involved in signal transduction

pathways at synapses which also showed strong associa-tions with susceptibility to schizophrenia (SupplementaryFigures 1 and 2 Supplementary Table 4 and SupplementaryText)

Since P-values are sensitive to sample size we next deter-mined whether the terms were more enriched in gt250 orgt100 bp RY tract-containing genes based on lsquofold-enrichmentrsquo Consistently greater enrichments were notedfor the gt250 bp RY tract-containing genes (Table 1)Hence as the RY tract length increases so does the proba-bility of their association with genes involved in specificfunctions at the plasma membrane

A second set of terms was found to be highly enriched inthe gt100 bp RY tract-containing genes but not in the gt250bp RY tract-containing genes These included terms contain-ing transferase and kinase activities (GO_MF) and proteinreceptor phosphorylation of genes implicated in developmentand morphogenesis (GO_BP) implying an enrichment ingenes involved in signal transduction pathways (Supplemen-tary Figure 3 and Supplementary Text)

Hence we conclude that the longest RY tracts in thehuman genome are distributed between two main pools ofgenes in a length-dependent fashion The first group com-prising shorter length tracts (100 bp lt RY lt 250 bp) co-localized preferentially with genes involved in signal trans-duction pathways associated with development whereas asecond group comprising longer tracts (RY gt 250 bp)co-localized with genes mostly involved in ion transportcell adhesion and neurogenesis Since several terms exceeded

Table 1 Major enriched categories for genes with gt 250 bp and gt 100 bp RY

Combined Rank Category (termmdashpathway) P-value Fold enrichmentgt250 RY gt100 RY gt250 RY gt100 RY

GO molecuar function10 Ion channel activity 195E05 592E09 44 2212 Protein binding 314E03 625E15 17 1620 Glutamate receptor activity 611E04 192E07 103 43

GO biological process5 Cell adhesion 111E04 336E12 34 227 Cell communication 219E04 524E15 16 14

15 Transmission of nerve impulse 183E04 524E08 51 25GO cellular component

2 Membrane 278E06 246E14 16 1315 Synapse 218E02 769E05 50 2819 Extracellular matrix 496E02 133E08 23 22

Pfam family2 Fibronectin type III 284E05 143E12 51 28

15 C-cadherin 776E05 314E05 170 4317 C2 117E04 395E05 54 21

PIR family3 a-catenin 355E04 120E03 604 94

11 Cadherin 181E02 404E04 95 40KEGG pathway

5 Neuroactive ligandndashreceptor interaction 706E03 252E03 29 156 Phosphatidylinositol signaling system 230E02 187E05 48 267 Choleramdashinfection 126E02 339E03 60 22

Protein interaction partners22 DLG4 798E04 206E03 451 6123 RAC1 834E03 978E04 71 24

Disease association4 Schizophrenia 523E03 233E03 75 26

Combined Rank sum of the two categories ranks from Supplementary Tables 2 and 3 categories with one entry were excluded when more than one categorywith largely redundant gene entries was present only one was chosen

Nucleic Acids Research 2006 Vol 34 No 9 2667

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

P-values of 1010 and the different databases (particularly thelarge GO_MF GO_BP GO_CC and Pfam Family) were inagreement in identifying terms that contained the samegenes the conclusions are strongly supported by the data

Tissue gene expression

To determine whether the gt250 bp RY tract-containinggenes (250-set) and the gt100 bp RY tract-containinggenes (100-set) expressed a greater proportion of transcriptsin specific tissues as compared with the reference dataset(All-set) we examined the Affymetrix U133A tissue expres-sion dataset derived from 79 tissues from the Genomic Insti-tute of the Novartis Research Foundation (GNF) For a giventissue the fraction of highly expressed genes in the 250-set(and 100-set) was then compared with the fraction of highlyexpressed genes in the reference dataset (All-set)

The fractions of highly expressed genes containing RYtracts (250-set and 100-set) were significantly greater thanthe corresponding fractions of the All-set in all brain tissuesexamined in the atrioventricular node of the fetus and theuterus (Figure 1 P-values ranged from 4 middot 102 to lt1 middot1013 for the 100-set and from 2 middot 102 to 6 middot 105 forthe 250-set) These fractions were also consistently moreenriched in genes with longer RY tracts (250-set) thanwith shorter ones (100-set) regardless of the P-values Insummary these data are compatible either with the viewthat RY tracts with lengths gt100 bp (including gt250) co-localize preferentially with genes highly expressed in thebrain or that these tracts may perform a role in mediatinggene transcription in brain tissues

Gene size distributions

Some of the genes expressed in human brain tissues such ascontactin-associated protein-like 2 (CNTNAP2) dystrophin(DMD) and ataxin-2 binding protein (A2BP1) are amongthe largest known each spanning gt15 Mb of genomicDNA We therefore posed the question as to whether the pref-erential finding of RY tracts in brain-expressed genes mightbe associated with larger gene size The size distribution forthe annotated human gene population followed a Gaussiandistribution when the logarithms of gene lengths were ana-lyzed (Figure 2 gray) According to this distribution theaverage gene length is 18 kb (Figure 2mdashmode and meancoincide for Gaussian distributions) In contrast the set of1377 genes highly expressed in the brain yielded a Gaussiandistribution peaking at 43 kb (black) Thus human genespredominantly expressed in brain tissues are on averagemore than twice as long as those from the total gene popula-tion when their log-normal distributions are compared Thisnotwithstanding the size distributions of the RY tract-containing genes were of disproportionately greater sizewith the gt100 bp RY tract-containing gene set peaking at108 kb (green) and the gt250 bp RY tract-containing geneset peaking at 192 kb (red) In addition the gt250 bp RYtract-containing genes displayed a pronounced negativeskewness and hence a non-Gaussian behavior We thereforeconclude that larger human genes also tend to host longerRY tracts

Next we investigated whether larger genes also containeda greater number of RY tracts By analyzing the gt100 bpRY tract-containing gene set for all pure RY tracts gt50

Figure 1 Tissue distribution of RY tract-containing genes The percentage increase in the fraction of RY-containing genes (100-set and 250-set) highly expressed(Z-score gt1) in a given tissue relative to the fraction of all genes highly expressed in the same tissue is displayed on the x-axis Asterisks significantly enriched tissuesas determined by binomial probability after a Bonferroni multiple comparison correction Significance P lt 107 107 lt P lt 005 - not significant

2668 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

bp in length 33 genes were found to harbor between 20 and57 RY tracts (Supplementary Table 5) The size of thesegenes ranged from 38 kb to 23 Mb with a median of800 kb and an average of 910 kb indicating that largergenes also tend to host more RY tracts as predicted

The gene size distributions of Figure 2 also raise the ques-tion as to whether the percent increases in the proportions ofhighly expressed RY tract-containing genes in the brain(Figure 1) could be accounted for by the overall greaterlength of the genes expressed in this organ Consideringthat the distributions PAll PBrain PRY gt 100 and PRY gt 250

(Figure 2 gray black green and red lines respectively) areinterdependent and log-normal (with the exception of PRY gt

250) the predicted ratio increase may be calculated from thefollowing relationshipRethPRY middot PBrainTHORNdxRethPRY middot PAllTHORNdx

sbquo

where PRY is either PRYgt100 or PRYgt250 using the means andstandard deviations that defined the respective curvesThe values obtained for the gt100 and gt250 bp RYtract-containing gene sets were 130 and 132 respectivelyfor the integrals bound by the log gene sizes of 0 and 8(which extends beyond all gene sizes) yielding a predictedpercent increase of 30 For the 250-set all percentageincreases were greater than this value of 30 for the 13brain tissues examined (Figure 1) About 23 of the 100-setpercentage increases in brain tissues were gt30 (up to75) whereas 13 were close to 30 In summary thesedata indicate that the number of RY tract-containing genes(but not the number of RY tracts per kilobase pair)expressed in the brain is greater than expected based ontheir longer lengths

The pseudoautosomal region

The total number of pure RY tracts gt250 bp in the humangenome was 814 We evaluated whether these tracts were

evenly dispersed following a Poisson-like distributionthroughout chromosomes or whether instead they were clus-tered in specific regions By testing each chromosome sepa-rately four small clusters were noted on chromosomes 5 6and 12 each involving 2ndash5 tracts (Supplementary Figure 4)In contrast two large clusters were identified on the X and Ychromosomes of 17 and 11 tracts respectively starting fromthe telomeric p-arm and extending for 62 and 26 Mbrespectively The sequence of the first 26 Mb (pseudoautoso-mal region or PAR1) is identical on both sex chromosomesand is essential for homologous recombination during malemeiosis and chromosome pairing (5253) This promptedspeculation that the RY tracts might play a role in PAR1function by virtue of their structural properties perhaps inconcert with other non-B DNA-forming sequences such asIR DR and MR

A search for pairs of DR IR and MR with an arbitrarylength cut-off of gt62 bp in the first 70 Mb of the X chromo-some revealed a significant (c2 test) over-representation ofDR of both gt62 bp and gt250 bp and IR of gt62 bp in thePAR1 region by comparison with the 44 Mb region that fol-lows (After_PAR P lt 00001) In contrast MR of gt62 bpand IR of gt250 bp were not significantly over-represented(Figure 3 and Supplementary Text) In addition the distribu-tion of MR was characterized by a preponderance of(GAAATTTC)n motifs in PAR1 (1011 in PAR1 and 815in After-PAR) We surveyed the sequence composition ofall DR and IR of gt180 bp to determine whether they wereuniquely represented in the PAR1 region or whether they cor-responded to interspersed elements distributed genome-wideNo significant homology was found to any of the 201 repeatsin PAR1 In contrast 1121 repeats in the After-PAR regionexhibited extensive homologies with other chromosomalsites indeed 911 tracts were identified as LINE1 (79)LINE1LTR (19) or LTR (19) elements Additional largesegments of repetitive DNA were also identified in PAR1Seven (gt1 kb) were closely examined and found to compriseorderly arrays of tandem repeat blocks (TRB) containingmultiple sequence motifs with few interruptions (Supplemen-tary Figure 6) In summary these data indicated that the clus-tered RY tracts gt250 bp form part of a larger family ofrepetitive elements mostly DR of unique composition andshort IR that densely populate the PAR1 region

The (GAAATTTC)n repeats

The recurrence of the (GAAATTTC)n repeat within PAR1was intriguing since this sequence possesses the RY mirrorsymmetry that facilitates triplex-formation (1361718) Inaddition its similarity to the (GAATTC)n motifs whichform extraordinarily stable non-B structures (26345455)raised the question as to whether these structural propertiesmight have been functionally exploited throughout the gen-ome We addressed this issue by searching for RY tractswith mirror symmetry gt30 bp in length (sufficient to yieldstable triplexes) and comparing their frequencies with thoseof all microsatellites of comparable repeat units and totallengths The search for microsatellites with unit lengthsfrom 1 to 29 nt yielded a bimodal distribution with a firstpeak comprising the mono- to penta-nucleotide repeats(Supplementary Figure 5) and a second peak composed of

Figure 2 Relationship between gene size and the presence of RY tracts Thefractions of genes with log gene size falling within 02 (025 for gt100 bp RYtract-containing genes and 03 for gt250 bp RY tract-containing genes) log-intervals plotted as a function of their mean values Gray 22799 human genesblack 1377 brain-specific human genes green 1891 gt100 bp RY tract-containing human genes red 200gt250 bp RY tract-containing human genes

Nucleic Acids Research 2006 Vol 34 No 9 2669

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

of the RY tracts observed along each chromosome bycomparing their distribution with that of a uniform Poissondistribution We assume that Xi are the distances betweenadjacent RY tracts (n tracts in total) along an individualchromosome that are not necessarily independently and ident-ically distributed (iid) Note that all points are mapped into a[01] interval Denote by

YethrTHORNi frac14

Xithornr1

jfrac141

Xjsbquo i frac14 1sbquo sbquon r thorn 1sbquor n

the distance from the j-th RY tract to its r-th nearest neigh-bor There is an order statistic associated with the sequence ofpartial sums

YethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1

and a minimum defined as

methrTHORN frac14 Y1 frac14 min fY

ethrTHORN1 sbquoY

ethrTHORN2 sbquo sbquoY

ethrTHORNnrthorn1g

If we assume that distances Xi are iid then according toKarlin and Macken (47)

Pr methrTHORN ltx

n1thorneth1rTHORN

1 exp eth lTHORNsbquo l frac14 xr

rsbquor gt 1

We declare significant clustering when the observed mini-mum for the limit distribution has lt001 (or lt005) probabil-ity of occurring by chance and therefore

methrTHORN lt1

n

reth ln 099THORNn

1r

Several RY tract clusters were detected (after correctingfor adjacent tracts separated by single nucleotide interrup-tions) by testing the first minimum of r-scans (where r frac141 16)

Complexity analysis

Complexity analysis as devised by Gusev et al (48) wasemployed in a search for DR IR and mirror repeats (MR)in the pseudoautosomal region (PAR1) and the downstream44 Mb region (After_PAR) in the 5 kb sequences flankingthe RY tracts in the PRKCB1 and ADAM18 genes in the30-untranslated region (30-UTR) of the DISC1 gene and inthe comparative analyses of the RY tracts in the mammalianTIAM1 gene

RESULTS

RY tracts within genes

The human genome was screened for the largest RY tractswith the expectation that these could shed light on theirpotential biological role(s) The total number of tracts equal-ing or exceeding an arbitrary length cutoff of 250 pure R(adenine A and guanine G) or Y (cytosine C and thymineT) bases was 814 Approximately 30 (241814) werelocated within annotated genes (228 some genes containedmore than one tract Supplementary Table 1) all withinintrons The longest tract (1303 bp) was present in the

CENTA1 gene on chromosome 7 which was the only pureintragenic sequence to exceed 1 kb Tracts were then ana-lyzed to determine the RY tract length if occasional inter-ruptions (mostly single nt changes) were ignored The25 kb PKD1 RY tract (49) which has been instrumentalin the characterization of the mutational properties of RYsequences (92532) was the third longest preceded onlyby a 31 kb tract in the DKFZp434G0625 gene and a tractof nearly 4 kb in the LOC348094 gene (both encoding prod-ucts of unknown function) Hence the PKD1 gene possessesthe longest gene-associated RY tract in the human genomefor genes of known function A double logarithmic plot ofthese tracts arranged in order of size exhibited a linear rela-tionship (y frac14 3499 ndash 0458x r2 frac14 099 data not shown)indicating an exponential decay distribution This distributionsuggests that selection against or in favor of any RY tractlength is not exercised to a detectable level within the exist-ing envelope

Complexity analysis was employed to identify MR sepa-rated by any distance within these RY tracts The averagefractions of non-redundant MR elements (with no interrup-tions) per RY tract were 16 4 and 2 for MR lengths of10 20 and 30 bp respectively In addition all RY tractshad at least one 10 bp-long MR element Hence based onthe work conducted on the PKD1 and other RY tractsin vivo (928303235375051) we conclude that most ifnot all RY tracts with gt250 uninterrupted base pairs havethe potential to form intramolecular andor intermolecular tri-plexes in vivo However future work will be required toexperimentally evaluate triplex formation by these RYtracts

To determine whether there was preferential associationwithin certain categories of genes we compared the set ofRY tract-containing genes with that of the human (refer-ence) dataset in proteomic databases (Supplementary Table 2)For the Gene Ontology (GO) Molecular Function eightterms exceeded a P-value of 104 and included seven termsfor channel activities and one term for glutamate receptoractivity consistent with strong enrichment for genes encod-ing transmembrane proteins in the brain (SupplementaryTable 2A) The GO Biological Process analysis revealed13 terms which exceeded P-values of 104 and includedthree terms for cell adhesion and cell communication fourfor neuronal function and three for ion transport (Supplemen-tary Table 2B) indicative of a preferential distribution withingenes involved in specialized functions at the cell membraneThe GO Cellular Component analysis (SupplementaryTable 2C) yielded four terms that exceeded P-values of104 and which were associated with localization to the cellmembrane

The fibronectin type III C-cadherin and a-catenindomains present in many cell surface receptors and celladhesion molecules represented the most highly enrichedcategories (P-values 104) in the Protein Families (Pfam)and Protein Information Resource (PIR) databases (Supple-mentary Tables 2D and 2E) whereas eight terms includingthe large neuroactive ligand interaction pathway which com-prises an array of neuronal receptors implicated in neuropep-tide and small molecule signaling pathways were found to beenriched in the more limited BioCarta and Kyoto Encyclope-dia of Genes and Genomes databases (Supplementary

2666 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

Tables 2F and 2G) Analysis of the Protein InteractionPartners database (Supplementary Table 2H) indicated3491 enrichments led by DLG4 (psd-95 P 8 middot 104) acomponent of the post-synaptic density structure involvedin receptor clustering Finally analysis of the DiseaseAssociation database indicated that the most significantenrichment (P 5 middot 103) was for candidate genes forschizophrenia susceptibility (Supplementary Table 2I) Insummary the longest RY tracts were non-randomly dis-tributed and were disproportionately associated withgenes whose products are localized to the plasma membraneand which perform cell communication and transportfunctions

Distinct gene categories

The gt250 bp RY tract-containing genes represented 1of the annotated human gene dataset We therefore soughtto determine whether lowering the RY length thresholdwould yield a larger set of genes with the same non-randomdistribution profile A search for genes with known functionscontaining pure RY tracts gt100 bp in length yielded a totalof 2886 hits and a non-redundant gene set of 1957 Com-parison of the distribution of these RY tract-containinggenes with that of the reference dataset (SupplementaryTable 3) revealed that the terms most highly enriched werecommon to those identified for the genes containing thegt250 RY tracts (Table 1) This correlation was particularlystriking for gene products involved in signal transduction

pathways at synapses which also showed strong associa-tions with susceptibility to schizophrenia (SupplementaryFigures 1 and 2 Supplementary Table 4 and SupplementaryText)

Since P-values are sensitive to sample size we next deter-mined whether the terms were more enriched in gt250 orgt100 bp RY tract-containing genes based on lsquofold-enrichmentrsquo Consistently greater enrichments were notedfor the gt250 bp RY tract-containing genes (Table 1)Hence as the RY tract length increases so does the proba-bility of their association with genes involved in specificfunctions at the plasma membrane

A second set of terms was found to be highly enriched inthe gt100 bp RY tract-containing genes but not in the gt250bp RY tract-containing genes These included terms contain-ing transferase and kinase activities (GO_MF) and proteinreceptor phosphorylation of genes implicated in developmentand morphogenesis (GO_BP) implying an enrichment ingenes involved in signal transduction pathways (Supplemen-tary Figure 3 and Supplementary Text)

Hence we conclude that the longest RY tracts in thehuman genome are distributed between two main pools ofgenes in a length-dependent fashion The first group com-prising shorter length tracts (100 bp lt RY lt 250 bp) co-localized preferentially with genes involved in signal trans-duction pathways associated with development whereas asecond group comprising longer tracts (RY gt 250 bp)co-localized with genes mostly involved in ion transportcell adhesion and neurogenesis Since several terms exceeded

Table 1 Major enriched categories for genes with gt 250 bp and gt 100 bp RY

Combined Rank Category (termmdashpathway) P-value Fold enrichmentgt250 RY gt100 RY gt250 RY gt100 RY

GO molecuar function10 Ion channel activity 195E05 592E09 44 2212 Protein binding 314E03 625E15 17 1620 Glutamate receptor activity 611E04 192E07 103 43

GO biological process5 Cell adhesion 111E04 336E12 34 227 Cell communication 219E04 524E15 16 14

15 Transmission of nerve impulse 183E04 524E08 51 25GO cellular component

2 Membrane 278E06 246E14 16 1315 Synapse 218E02 769E05 50 2819 Extracellular matrix 496E02 133E08 23 22

Pfam family2 Fibronectin type III 284E05 143E12 51 28

15 C-cadherin 776E05 314E05 170 4317 C2 117E04 395E05 54 21

PIR family3 a-catenin 355E04 120E03 604 94

11 Cadherin 181E02 404E04 95 40KEGG pathway

5 Neuroactive ligandndashreceptor interaction 706E03 252E03 29 156 Phosphatidylinositol signaling system 230E02 187E05 48 267 Choleramdashinfection 126E02 339E03 60 22

Protein interaction partners22 DLG4 798E04 206E03 451 6123 RAC1 834E03 978E04 71 24

Disease association4 Schizophrenia 523E03 233E03 75 26

Combined Rank sum of the two categories ranks from Supplementary Tables 2 and 3 categories with one entry were excluded when more than one categorywith largely redundant gene entries was present only one was chosen

Nucleic Acids Research 2006 Vol 34 No 9 2667

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

P-values of 1010 and the different databases (particularly thelarge GO_MF GO_BP GO_CC and Pfam Family) were inagreement in identifying terms that contained the samegenes the conclusions are strongly supported by the data

Tissue gene expression

To determine whether the gt250 bp RY tract-containinggenes (250-set) and the gt100 bp RY tract-containinggenes (100-set) expressed a greater proportion of transcriptsin specific tissues as compared with the reference dataset(All-set) we examined the Affymetrix U133A tissue expres-sion dataset derived from 79 tissues from the Genomic Insti-tute of the Novartis Research Foundation (GNF) For a giventissue the fraction of highly expressed genes in the 250-set(and 100-set) was then compared with the fraction of highlyexpressed genes in the reference dataset (All-set)

The fractions of highly expressed genes containing RYtracts (250-set and 100-set) were significantly greater thanthe corresponding fractions of the All-set in all brain tissuesexamined in the atrioventricular node of the fetus and theuterus (Figure 1 P-values ranged from 4 middot 102 to lt1 middot1013 for the 100-set and from 2 middot 102 to 6 middot 105 forthe 250-set) These fractions were also consistently moreenriched in genes with longer RY tracts (250-set) thanwith shorter ones (100-set) regardless of the P-values Insummary these data are compatible either with the viewthat RY tracts with lengths gt100 bp (including gt250) co-localize preferentially with genes highly expressed in thebrain or that these tracts may perform a role in mediatinggene transcription in brain tissues

Gene size distributions

Some of the genes expressed in human brain tissues such ascontactin-associated protein-like 2 (CNTNAP2) dystrophin(DMD) and ataxin-2 binding protein (A2BP1) are amongthe largest known each spanning gt15 Mb of genomicDNA We therefore posed the question as to whether the pref-erential finding of RY tracts in brain-expressed genes mightbe associated with larger gene size The size distribution forthe annotated human gene population followed a Gaussiandistribution when the logarithms of gene lengths were ana-lyzed (Figure 2 gray) According to this distribution theaverage gene length is 18 kb (Figure 2mdashmode and meancoincide for Gaussian distributions) In contrast the set of1377 genes highly expressed in the brain yielded a Gaussiandistribution peaking at 43 kb (black) Thus human genespredominantly expressed in brain tissues are on averagemore than twice as long as those from the total gene popula-tion when their log-normal distributions are compared Thisnotwithstanding the size distributions of the RY tract-containing genes were of disproportionately greater sizewith the gt100 bp RY tract-containing gene set peaking at108 kb (green) and the gt250 bp RY tract-containing geneset peaking at 192 kb (red) In addition the gt250 bp RYtract-containing genes displayed a pronounced negativeskewness and hence a non-Gaussian behavior We thereforeconclude that larger human genes also tend to host longerRY tracts

Next we investigated whether larger genes also containeda greater number of RY tracts By analyzing the gt100 bpRY tract-containing gene set for all pure RY tracts gt50

Figure 1 Tissue distribution of RY tract-containing genes The percentage increase in the fraction of RY-containing genes (100-set and 250-set) highly expressed(Z-score gt1) in a given tissue relative to the fraction of all genes highly expressed in the same tissue is displayed on the x-axis Asterisks significantly enriched tissuesas determined by binomial probability after a Bonferroni multiple comparison correction Significance P lt 107 107 lt P lt 005 - not significant

2668 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

bp in length 33 genes were found to harbor between 20 and57 RY tracts (Supplementary Table 5) The size of thesegenes ranged from 38 kb to 23 Mb with a median of800 kb and an average of 910 kb indicating that largergenes also tend to host more RY tracts as predicted

The gene size distributions of Figure 2 also raise the ques-tion as to whether the percent increases in the proportions ofhighly expressed RY tract-containing genes in the brain(Figure 1) could be accounted for by the overall greaterlength of the genes expressed in this organ Consideringthat the distributions PAll PBrain PRY gt 100 and PRY gt 250

(Figure 2 gray black green and red lines respectively) areinterdependent and log-normal (with the exception of PRY gt

250) the predicted ratio increase may be calculated from thefollowing relationshipRethPRY middot PBrainTHORNdxRethPRY middot PAllTHORNdx

sbquo

where PRY is either PRYgt100 or PRYgt250 using the means andstandard deviations that defined the respective curvesThe values obtained for the gt100 and gt250 bp RYtract-containing gene sets were 130 and 132 respectivelyfor the integrals bound by the log gene sizes of 0 and 8(which extends beyond all gene sizes) yielding a predictedpercent increase of 30 For the 250-set all percentageincreases were greater than this value of 30 for the 13brain tissues examined (Figure 1) About 23 of the 100-setpercentage increases in brain tissues were gt30 (up to75) whereas 13 were close to 30 In summary thesedata indicate that the number of RY tract-containing genes(but not the number of RY tracts per kilobase pair)expressed in the brain is greater than expected based ontheir longer lengths

The pseudoautosomal region

The total number of pure RY tracts gt250 bp in the humangenome was 814 We evaluated whether these tracts were

evenly dispersed following a Poisson-like distributionthroughout chromosomes or whether instead they were clus-tered in specific regions By testing each chromosome sepa-rately four small clusters were noted on chromosomes 5 6and 12 each involving 2ndash5 tracts (Supplementary Figure 4)In contrast two large clusters were identified on the X and Ychromosomes of 17 and 11 tracts respectively starting fromthe telomeric p-arm and extending for 62 and 26 Mbrespectively The sequence of the first 26 Mb (pseudoautoso-mal region or PAR1) is identical on both sex chromosomesand is essential for homologous recombination during malemeiosis and chromosome pairing (5253) This promptedspeculation that the RY tracts might play a role in PAR1function by virtue of their structural properties perhaps inconcert with other non-B DNA-forming sequences such asIR DR and MR

A search for pairs of DR IR and MR with an arbitrarylength cut-off of gt62 bp in the first 70 Mb of the X chromo-some revealed a significant (c2 test) over-representation ofDR of both gt62 bp and gt250 bp and IR of gt62 bp in thePAR1 region by comparison with the 44 Mb region that fol-lows (After_PAR P lt 00001) In contrast MR of gt62 bpand IR of gt250 bp were not significantly over-represented(Figure 3 and Supplementary Text) In addition the distribu-tion of MR was characterized by a preponderance of(GAAATTTC)n motifs in PAR1 (1011 in PAR1 and 815in After-PAR) We surveyed the sequence composition ofall DR and IR of gt180 bp to determine whether they wereuniquely represented in the PAR1 region or whether they cor-responded to interspersed elements distributed genome-wideNo significant homology was found to any of the 201 repeatsin PAR1 In contrast 1121 repeats in the After-PAR regionexhibited extensive homologies with other chromosomalsites indeed 911 tracts were identified as LINE1 (79)LINE1LTR (19) or LTR (19) elements Additional largesegments of repetitive DNA were also identified in PAR1Seven (gt1 kb) were closely examined and found to compriseorderly arrays of tandem repeat blocks (TRB) containingmultiple sequence motifs with few interruptions (Supplemen-tary Figure 6) In summary these data indicated that the clus-tered RY tracts gt250 bp form part of a larger family ofrepetitive elements mostly DR of unique composition andshort IR that densely populate the PAR1 region

The (GAAATTTC)n repeats

The recurrence of the (GAAATTTC)n repeat within PAR1was intriguing since this sequence possesses the RY mirrorsymmetry that facilitates triplex-formation (1361718) Inaddition its similarity to the (GAATTC)n motifs whichform extraordinarily stable non-B structures (26345455)raised the question as to whether these structural propertiesmight have been functionally exploited throughout the gen-ome We addressed this issue by searching for RY tractswith mirror symmetry gt30 bp in length (sufficient to yieldstable triplexes) and comparing their frequencies with thoseof all microsatellites of comparable repeat units and totallengths The search for microsatellites with unit lengthsfrom 1 to 29 nt yielded a bimodal distribution with a firstpeak comprising the mono- to penta-nucleotide repeats(Supplementary Figure 5) and a second peak composed of

Figure 2 Relationship between gene size and the presence of RY tracts Thefractions of genes with log gene size falling within 02 (025 for gt100 bp RYtract-containing genes and 03 for gt250 bp RY tract-containing genes) log-intervals plotted as a function of their mean values Gray 22799 human genesblack 1377 brain-specific human genes green 1891 gt100 bp RY tract-containing human genes red 200gt250 bp RY tract-containing human genes

Nucleic Acids Research 2006 Vol 34 No 9 2669

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

Tables 2F and 2G) Analysis of the Protein InteractionPartners database (Supplementary Table 2H) indicated3491 enrichments led by DLG4 (psd-95 P 8 middot 104) acomponent of the post-synaptic density structure involvedin receptor clustering Finally analysis of the DiseaseAssociation database indicated that the most significantenrichment (P 5 middot 103) was for candidate genes forschizophrenia susceptibility (Supplementary Table 2I) Insummary the longest RY tracts were non-randomly dis-tributed and were disproportionately associated withgenes whose products are localized to the plasma membraneand which perform cell communication and transportfunctions

Distinct gene categories

The gt250 bp RY tract-containing genes represented 1of the annotated human gene dataset We therefore soughtto determine whether lowering the RY length thresholdwould yield a larger set of genes with the same non-randomdistribution profile A search for genes with known functionscontaining pure RY tracts gt100 bp in length yielded a totalof 2886 hits and a non-redundant gene set of 1957 Com-parison of the distribution of these RY tract-containinggenes with that of the reference dataset (SupplementaryTable 3) revealed that the terms most highly enriched werecommon to those identified for the genes containing thegt250 RY tracts (Table 1) This correlation was particularlystriking for gene products involved in signal transduction

pathways at synapses which also showed strong associa-tions with susceptibility to schizophrenia (SupplementaryFigures 1 and 2 Supplementary Table 4 and SupplementaryText)

Since P-values are sensitive to sample size we next deter-mined whether the terms were more enriched in gt250 orgt100 bp RY tract-containing genes based on lsquofold-enrichmentrsquo Consistently greater enrichments were notedfor the gt250 bp RY tract-containing genes (Table 1)Hence as the RY tract length increases so does the proba-bility of their association with genes involved in specificfunctions at the plasma membrane

A second set of terms was found to be highly enriched inthe gt100 bp RY tract-containing genes but not in the gt250bp RY tract-containing genes These included terms contain-ing transferase and kinase activities (GO_MF) and proteinreceptor phosphorylation of genes implicated in developmentand morphogenesis (GO_BP) implying an enrichment ingenes involved in signal transduction pathways (Supplemen-tary Figure 3 and Supplementary Text)

Hence we conclude that the longest RY tracts in thehuman genome are distributed between two main pools ofgenes in a length-dependent fashion The first group com-prising shorter length tracts (100 bp lt RY lt 250 bp) co-localized preferentially with genes involved in signal trans-duction pathways associated with development whereas asecond group comprising longer tracts (RY gt 250 bp)co-localized with genes mostly involved in ion transportcell adhesion and neurogenesis Since several terms exceeded

Table 1 Major enriched categories for genes with gt 250 bp and gt 100 bp RY

Combined Rank Category (termmdashpathway) P-value Fold enrichmentgt250 RY gt100 RY gt250 RY gt100 RY

GO molecuar function10 Ion channel activity 195E05 592E09 44 2212 Protein binding 314E03 625E15 17 1620 Glutamate receptor activity 611E04 192E07 103 43

GO biological process5 Cell adhesion 111E04 336E12 34 227 Cell communication 219E04 524E15 16 14

15 Transmission of nerve impulse 183E04 524E08 51 25GO cellular component

2 Membrane 278E06 246E14 16 1315 Synapse 218E02 769E05 50 2819 Extracellular matrix 496E02 133E08 23 22

Pfam family2 Fibronectin type III 284E05 143E12 51 28

15 C-cadherin 776E05 314E05 170 4317 C2 117E04 395E05 54 21

PIR family3 a-catenin 355E04 120E03 604 94

11 Cadherin 181E02 404E04 95 40KEGG pathway

5 Neuroactive ligandndashreceptor interaction 706E03 252E03 29 156 Phosphatidylinositol signaling system 230E02 187E05 48 267 Choleramdashinfection 126E02 339E03 60 22

Protein interaction partners22 DLG4 798E04 206E03 451 6123 RAC1 834E03 978E04 71 24

Disease association4 Schizophrenia 523E03 233E03 75 26

Combined Rank sum of the two categories ranks from Supplementary Tables 2 and 3 categories with one entry were excluded when more than one categorywith largely redundant gene entries was present only one was chosen

Nucleic Acids Research 2006 Vol 34 No 9 2667

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

P-values of 1010 and the different databases (particularly thelarge GO_MF GO_BP GO_CC and Pfam Family) were inagreement in identifying terms that contained the samegenes the conclusions are strongly supported by the data

Tissue gene expression

To determine whether the gt250 bp RY tract-containinggenes (250-set) and the gt100 bp RY tract-containinggenes (100-set) expressed a greater proportion of transcriptsin specific tissues as compared with the reference dataset(All-set) we examined the Affymetrix U133A tissue expres-sion dataset derived from 79 tissues from the Genomic Insti-tute of the Novartis Research Foundation (GNF) For a giventissue the fraction of highly expressed genes in the 250-set(and 100-set) was then compared with the fraction of highlyexpressed genes in the reference dataset (All-set)

The fractions of highly expressed genes containing RYtracts (250-set and 100-set) were significantly greater thanthe corresponding fractions of the All-set in all brain tissuesexamined in the atrioventricular node of the fetus and theuterus (Figure 1 P-values ranged from 4 middot 102 to lt1 middot1013 for the 100-set and from 2 middot 102 to 6 middot 105 forthe 250-set) These fractions were also consistently moreenriched in genes with longer RY tracts (250-set) thanwith shorter ones (100-set) regardless of the P-values Insummary these data are compatible either with the viewthat RY tracts with lengths gt100 bp (including gt250) co-localize preferentially with genes highly expressed in thebrain or that these tracts may perform a role in mediatinggene transcription in brain tissues

Gene size distributions

Some of the genes expressed in human brain tissues such ascontactin-associated protein-like 2 (CNTNAP2) dystrophin(DMD) and ataxin-2 binding protein (A2BP1) are amongthe largest known each spanning gt15 Mb of genomicDNA We therefore posed the question as to whether the pref-erential finding of RY tracts in brain-expressed genes mightbe associated with larger gene size The size distribution forthe annotated human gene population followed a Gaussiandistribution when the logarithms of gene lengths were ana-lyzed (Figure 2 gray) According to this distribution theaverage gene length is 18 kb (Figure 2mdashmode and meancoincide for Gaussian distributions) In contrast the set of1377 genes highly expressed in the brain yielded a Gaussiandistribution peaking at 43 kb (black) Thus human genespredominantly expressed in brain tissues are on averagemore than twice as long as those from the total gene popula-tion when their log-normal distributions are compared Thisnotwithstanding the size distributions of the RY tract-containing genes were of disproportionately greater sizewith the gt100 bp RY tract-containing gene set peaking at108 kb (green) and the gt250 bp RY tract-containing geneset peaking at 192 kb (red) In addition the gt250 bp RYtract-containing genes displayed a pronounced negativeskewness and hence a non-Gaussian behavior We thereforeconclude that larger human genes also tend to host longerRY tracts

Next we investigated whether larger genes also containeda greater number of RY tracts By analyzing the gt100 bpRY tract-containing gene set for all pure RY tracts gt50

Figure 1 Tissue distribution of RY tract-containing genes The percentage increase in the fraction of RY-containing genes (100-set and 250-set) highly expressed(Z-score gt1) in a given tissue relative to the fraction of all genes highly expressed in the same tissue is displayed on the x-axis Asterisks significantly enriched tissuesas determined by binomial probability after a Bonferroni multiple comparison correction Significance P lt 107 107 lt P lt 005 - not significant

2668 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

bp in length 33 genes were found to harbor between 20 and57 RY tracts (Supplementary Table 5) The size of thesegenes ranged from 38 kb to 23 Mb with a median of800 kb and an average of 910 kb indicating that largergenes also tend to host more RY tracts as predicted

The gene size distributions of Figure 2 also raise the ques-tion as to whether the percent increases in the proportions ofhighly expressed RY tract-containing genes in the brain(Figure 1) could be accounted for by the overall greaterlength of the genes expressed in this organ Consideringthat the distributions PAll PBrain PRY gt 100 and PRY gt 250

(Figure 2 gray black green and red lines respectively) areinterdependent and log-normal (with the exception of PRY gt

250) the predicted ratio increase may be calculated from thefollowing relationshipRethPRY middot PBrainTHORNdxRethPRY middot PAllTHORNdx

sbquo

where PRY is either PRYgt100 or PRYgt250 using the means andstandard deviations that defined the respective curvesThe values obtained for the gt100 and gt250 bp RYtract-containing gene sets were 130 and 132 respectivelyfor the integrals bound by the log gene sizes of 0 and 8(which extends beyond all gene sizes) yielding a predictedpercent increase of 30 For the 250-set all percentageincreases were greater than this value of 30 for the 13brain tissues examined (Figure 1) About 23 of the 100-setpercentage increases in brain tissues were gt30 (up to75) whereas 13 were close to 30 In summary thesedata indicate that the number of RY tract-containing genes(but not the number of RY tracts per kilobase pair)expressed in the brain is greater than expected based ontheir longer lengths

The pseudoautosomal region

The total number of pure RY tracts gt250 bp in the humangenome was 814 We evaluated whether these tracts were

evenly dispersed following a Poisson-like distributionthroughout chromosomes or whether instead they were clus-tered in specific regions By testing each chromosome sepa-rately four small clusters were noted on chromosomes 5 6and 12 each involving 2ndash5 tracts (Supplementary Figure 4)In contrast two large clusters were identified on the X and Ychromosomes of 17 and 11 tracts respectively starting fromthe telomeric p-arm and extending for 62 and 26 Mbrespectively The sequence of the first 26 Mb (pseudoautoso-mal region or PAR1) is identical on both sex chromosomesand is essential for homologous recombination during malemeiosis and chromosome pairing (5253) This promptedspeculation that the RY tracts might play a role in PAR1function by virtue of their structural properties perhaps inconcert with other non-B DNA-forming sequences such asIR DR and MR

A search for pairs of DR IR and MR with an arbitrarylength cut-off of gt62 bp in the first 70 Mb of the X chromo-some revealed a significant (c2 test) over-representation ofDR of both gt62 bp and gt250 bp and IR of gt62 bp in thePAR1 region by comparison with the 44 Mb region that fol-lows (After_PAR P lt 00001) In contrast MR of gt62 bpand IR of gt250 bp were not significantly over-represented(Figure 3 and Supplementary Text) In addition the distribu-tion of MR was characterized by a preponderance of(GAAATTTC)n motifs in PAR1 (1011 in PAR1 and 815in After-PAR) We surveyed the sequence composition ofall DR and IR of gt180 bp to determine whether they wereuniquely represented in the PAR1 region or whether they cor-responded to interspersed elements distributed genome-wideNo significant homology was found to any of the 201 repeatsin PAR1 In contrast 1121 repeats in the After-PAR regionexhibited extensive homologies with other chromosomalsites indeed 911 tracts were identified as LINE1 (79)LINE1LTR (19) or LTR (19) elements Additional largesegments of repetitive DNA were also identified in PAR1Seven (gt1 kb) were closely examined and found to compriseorderly arrays of tandem repeat blocks (TRB) containingmultiple sequence motifs with few interruptions (Supplemen-tary Figure 6) In summary these data indicated that the clus-tered RY tracts gt250 bp form part of a larger family ofrepetitive elements mostly DR of unique composition andshort IR that densely populate the PAR1 region

The (GAAATTTC)n repeats

The recurrence of the (GAAATTTC)n repeat within PAR1was intriguing since this sequence possesses the RY mirrorsymmetry that facilitates triplex-formation (1361718) Inaddition its similarity to the (GAATTC)n motifs whichform extraordinarily stable non-B structures (26345455)raised the question as to whether these structural propertiesmight have been functionally exploited throughout the gen-ome We addressed this issue by searching for RY tractswith mirror symmetry gt30 bp in length (sufficient to yieldstable triplexes) and comparing their frequencies with thoseof all microsatellites of comparable repeat units and totallengths The search for microsatellites with unit lengthsfrom 1 to 29 nt yielded a bimodal distribution with a firstpeak comprising the mono- to penta-nucleotide repeats(Supplementary Figure 5) and a second peak composed of

Figure 2 Relationship between gene size and the presence of RY tracts Thefractions of genes with log gene size falling within 02 (025 for gt100 bp RYtract-containing genes and 03 for gt250 bp RY tract-containing genes) log-intervals plotted as a function of their mean values Gray 22799 human genesblack 1377 brain-specific human genes green 1891 gt100 bp RY tract-containing human genes red 200gt250 bp RY tract-containing human genes

Nucleic Acids Research 2006 Vol 34 No 9 2669

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

P-values of 1010 and the different databases (particularly thelarge GO_MF GO_BP GO_CC and Pfam Family) were inagreement in identifying terms that contained the samegenes the conclusions are strongly supported by the data

Tissue gene expression

To determine whether the gt250 bp RY tract-containinggenes (250-set) and the gt100 bp RY tract-containinggenes (100-set) expressed a greater proportion of transcriptsin specific tissues as compared with the reference dataset(All-set) we examined the Affymetrix U133A tissue expres-sion dataset derived from 79 tissues from the Genomic Insti-tute of the Novartis Research Foundation (GNF) For a giventissue the fraction of highly expressed genes in the 250-set(and 100-set) was then compared with the fraction of highlyexpressed genes in the reference dataset (All-set)

The fractions of highly expressed genes containing RYtracts (250-set and 100-set) were significantly greater thanthe corresponding fractions of the All-set in all brain tissuesexamined in the atrioventricular node of the fetus and theuterus (Figure 1 P-values ranged from 4 middot 102 to lt1 middot1013 for the 100-set and from 2 middot 102 to 6 middot 105 forthe 250-set) These fractions were also consistently moreenriched in genes with longer RY tracts (250-set) thanwith shorter ones (100-set) regardless of the P-values Insummary these data are compatible either with the viewthat RY tracts with lengths gt100 bp (including gt250) co-localize preferentially with genes highly expressed in thebrain or that these tracts may perform a role in mediatinggene transcription in brain tissues

Gene size distributions

Some of the genes expressed in human brain tissues such ascontactin-associated protein-like 2 (CNTNAP2) dystrophin(DMD) and ataxin-2 binding protein (A2BP1) are amongthe largest known each spanning gt15 Mb of genomicDNA We therefore posed the question as to whether the pref-erential finding of RY tracts in brain-expressed genes mightbe associated with larger gene size The size distribution forthe annotated human gene population followed a Gaussiandistribution when the logarithms of gene lengths were ana-lyzed (Figure 2 gray) According to this distribution theaverage gene length is 18 kb (Figure 2mdashmode and meancoincide for Gaussian distributions) In contrast the set of1377 genes highly expressed in the brain yielded a Gaussiandistribution peaking at 43 kb (black) Thus human genespredominantly expressed in brain tissues are on averagemore than twice as long as those from the total gene popula-tion when their log-normal distributions are compared Thisnotwithstanding the size distributions of the RY tract-containing genes were of disproportionately greater sizewith the gt100 bp RY tract-containing gene set peaking at108 kb (green) and the gt250 bp RY tract-containing geneset peaking at 192 kb (red) In addition the gt250 bp RYtract-containing genes displayed a pronounced negativeskewness and hence a non-Gaussian behavior We thereforeconclude that larger human genes also tend to host longerRY tracts

Next we investigated whether larger genes also containeda greater number of RY tracts By analyzing the gt100 bpRY tract-containing gene set for all pure RY tracts gt50

Figure 1 Tissue distribution of RY tract-containing genes The percentage increase in the fraction of RY-containing genes (100-set and 250-set) highly expressed(Z-score gt1) in a given tissue relative to the fraction of all genes highly expressed in the same tissue is displayed on the x-axis Asterisks significantly enriched tissuesas determined by binomial probability after a Bonferroni multiple comparison correction Significance P lt 107 107 lt P lt 005 - not significant

2668 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

bp in length 33 genes were found to harbor between 20 and57 RY tracts (Supplementary Table 5) The size of thesegenes ranged from 38 kb to 23 Mb with a median of800 kb and an average of 910 kb indicating that largergenes also tend to host more RY tracts as predicted

The gene size distributions of Figure 2 also raise the ques-tion as to whether the percent increases in the proportions ofhighly expressed RY tract-containing genes in the brain(Figure 1) could be accounted for by the overall greaterlength of the genes expressed in this organ Consideringthat the distributions PAll PBrain PRY gt 100 and PRY gt 250

(Figure 2 gray black green and red lines respectively) areinterdependent and log-normal (with the exception of PRY gt

250) the predicted ratio increase may be calculated from thefollowing relationshipRethPRY middot PBrainTHORNdxRethPRY middot PAllTHORNdx

sbquo

where PRY is either PRYgt100 or PRYgt250 using the means andstandard deviations that defined the respective curvesThe values obtained for the gt100 and gt250 bp RYtract-containing gene sets were 130 and 132 respectivelyfor the integrals bound by the log gene sizes of 0 and 8(which extends beyond all gene sizes) yielding a predictedpercent increase of 30 For the 250-set all percentageincreases were greater than this value of 30 for the 13brain tissues examined (Figure 1) About 23 of the 100-setpercentage increases in brain tissues were gt30 (up to75) whereas 13 were close to 30 In summary thesedata indicate that the number of RY tract-containing genes(but not the number of RY tracts per kilobase pair)expressed in the brain is greater than expected based ontheir longer lengths

The pseudoautosomal region

The total number of pure RY tracts gt250 bp in the humangenome was 814 We evaluated whether these tracts were

evenly dispersed following a Poisson-like distributionthroughout chromosomes or whether instead they were clus-tered in specific regions By testing each chromosome sepa-rately four small clusters were noted on chromosomes 5 6and 12 each involving 2ndash5 tracts (Supplementary Figure 4)In contrast two large clusters were identified on the X and Ychromosomes of 17 and 11 tracts respectively starting fromthe telomeric p-arm and extending for 62 and 26 Mbrespectively The sequence of the first 26 Mb (pseudoautoso-mal region or PAR1) is identical on both sex chromosomesand is essential for homologous recombination during malemeiosis and chromosome pairing (5253) This promptedspeculation that the RY tracts might play a role in PAR1function by virtue of their structural properties perhaps inconcert with other non-B DNA-forming sequences such asIR DR and MR

A search for pairs of DR IR and MR with an arbitrarylength cut-off of gt62 bp in the first 70 Mb of the X chromo-some revealed a significant (c2 test) over-representation ofDR of both gt62 bp and gt250 bp and IR of gt62 bp in thePAR1 region by comparison with the 44 Mb region that fol-lows (After_PAR P lt 00001) In contrast MR of gt62 bpand IR of gt250 bp were not significantly over-represented(Figure 3 and Supplementary Text) In addition the distribu-tion of MR was characterized by a preponderance of(GAAATTTC)n motifs in PAR1 (1011 in PAR1 and 815in After-PAR) We surveyed the sequence composition ofall DR and IR of gt180 bp to determine whether they wereuniquely represented in the PAR1 region or whether they cor-responded to interspersed elements distributed genome-wideNo significant homology was found to any of the 201 repeatsin PAR1 In contrast 1121 repeats in the After-PAR regionexhibited extensive homologies with other chromosomalsites indeed 911 tracts were identified as LINE1 (79)LINE1LTR (19) or LTR (19) elements Additional largesegments of repetitive DNA were also identified in PAR1Seven (gt1 kb) were closely examined and found to compriseorderly arrays of tandem repeat blocks (TRB) containingmultiple sequence motifs with few interruptions (Supplemen-tary Figure 6) In summary these data indicated that the clus-tered RY tracts gt250 bp form part of a larger family ofrepetitive elements mostly DR of unique composition andshort IR that densely populate the PAR1 region

The (GAAATTTC)n repeats

The recurrence of the (GAAATTTC)n repeat within PAR1was intriguing since this sequence possesses the RY mirrorsymmetry that facilitates triplex-formation (1361718) Inaddition its similarity to the (GAATTC)n motifs whichform extraordinarily stable non-B structures (26345455)raised the question as to whether these structural propertiesmight have been functionally exploited throughout the gen-ome We addressed this issue by searching for RY tractswith mirror symmetry gt30 bp in length (sufficient to yieldstable triplexes) and comparing their frequencies with thoseof all microsatellites of comparable repeat units and totallengths The search for microsatellites with unit lengthsfrom 1 to 29 nt yielded a bimodal distribution with a firstpeak comprising the mono- to penta-nucleotide repeats(Supplementary Figure 5) and a second peak composed of

Figure 2 Relationship between gene size and the presence of RY tracts Thefractions of genes with log gene size falling within 02 (025 for gt100 bp RYtract-containing genes and 03 for gt250 bp RY tract-containing genes) log-intervals plotted as a function of their mean values Gray 22799 human genesblack 1377 brain-specific human genes green 1891 gt100 bp RY tract-containing human genes red 200gt250 bp RY tract-containing human genes

Nucleic Acids Research 2006 Vol 34 No 9 2669

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

bp in length 33 genes were found to harbor between 20 and57 RY tracts (Supplementary Table 5) The size of thesegenes ranged from 38 kb to 23 Mb with a median of800 kb and an average of 910 kb indicating that largergenes also tend to host more RY tracts as predicted

The gene size distributions of Figure 2 also raise the ques-tion as to whether the percent increases in the proportions ofhighly expressed RY tract-containing genes in the brain(Figure 1) could be accounted for by the overall greaterlength of the genes expressed in this organ Consideringthat the distributions PAll PBrain PRY gt 100 and PRY gt 250

(Figure 2 gray black green and red lines respectively) areinterdependent and log-normal (with the exception of PRY gt

250) the predicted ratio increase may be calculated from thefollowing relationshipRethPRY middot PBrainTHORNdxRethPRY middot PAllTHORNdx

sbquo

where PRY is either PRYgt100 or PRYgt250 using the means andstandard deviations that defined the respective curvesThe values obtained for the gt100 and gt250 bp RYtract-containing gene sets were 130 and 132 respectivelyfor the integrals bound by the log gene sizes of 0 and 8(which extends beyond all gene sizes) yielding a predictedpercent increase of 30 For the 250-set all percentageincreases were greater than this value of 30 for the 13brain tissues examined (Figure 1) About 23 of the 100-setpercentage increases in brain tissues were gt30 (up to75) whereas 13 were close to 30 In summary thesedata indicate that the number of RY tract-containing genes(but not the number of RY tracts per kilobase pair)expressed in the brain is greater than expected based ontheir longer lengths

The pseudoautosomal region

The total number of pure RY tracts gt250 bp in the humangenome was 814 We evaluated whether these tracts were

evenly dispersed following a Poisson-like distributionthroughout chromosomes or whether instead they were clus-tered in specific regions By testing each chromosome sepa-rately four small clusters were noted on chromosomes 5 6and 12 each involving 2ndash5 tracts (Supplementary Figure 4)In contrast two large clusters were identified on the X and Ychromosomes of 17 and 11 tracts respectively starting fromthe telomeric p-arm and extending for 62 and 26 Mbrespectively The sequence of the first 26 Mb (pseudoautoso-mal region or PAR1) is identical on both sex chromosomesand is essential for homologous recombination during malemeiosis and chromosome pairing (5253) This promptedspeculation that the RY tracts might play a role in PAR1function by virtue of their structural properties perhaps inconcert with other non-B DNA-forming sequences such asIR DR and MR

A search for pairs of DR IR and MR with an arbitrarylength cut-off of gt62 bp in the first 70 Mb of the X chromo-some revealed a significant (c2 test) over-representation ofDR of both gt62 bp and gt250 bp and IR of gt62 bp in thePAR1 region by comparison with the 44 Mb region that fol-lows (After_PAR P lt 00001) In contrast MR of gt62 bpand IR of gt250 bp were not significantly over-represented(Figure 3 and Supplementary Text) In addition the distribu-tion of MR was characterized by a preponderance of(GAAATTTC)n motifs in PAR1 (1011 in PAR1 and 815in After-PAR) We surveyed the sequence composition ofall DR and IR of gt180 bp to determine whether they wereuniquely represented in the PAR1 region or whether they cor-responded to interspersed elements distributed genome-wideNo significant homology was found to any of the 201 repeatsin PAR1 In contrast 1121 repeats in the After-PAR regionexhibited extensive homologies with other chromosomalsites indeed 911 tracts were identified as LINE1 (79)LINE1LTR (19) or LTR (19) elements Additional largesegments of repetitive DNA were also identified in PAR1Seven (gt1 kb) were closely examined and found to compriseorderly arrays of tandem repeat blocks (TRB) containingmultiple sequence motifs with few interruptions (Supplemen-tary Figure 6) In summary these data indicated that the clus-tered RY tracts gt250 bp form part of a larger family ofrepetitive elements mostly DR of unique composition andshort IR that densely populate the PAR1 region

The (GAAATTTC)n repeats

The recurrence of the (GAAATTTC)n repeat within PAR1was intriguing since this sequence possesses the RY mirrorsymmetry that facilitates triplex-formation (1361718) Inaddition its similarity to the (GAATTC)n motifs whichform extraordinarily stable non-B structures (26345455)raised the question as to whether these structural propertiesmight have been functionally exploited throughout the gen-ome We addressed this issue by searching for RY tractswith mirror symmetry gt30 bp in length (sufficient to yieldstable triplexes) and comparing their frequencies with thoseof all microsatellites of comparable repeat units and totallengths The search for microsatellites with unit lengthsfrom 1 to 29 nt yielded a bimodal distribution with a firstpeak comprising the mono- to penta-nucleotide repeats(Supplementary Figure 5) and a second peak composed of

Figure 2 Relationship between gene size and the presence of RY tracts Thefractions of genes with log gene size falling within 02 (025 for gt100 bp RYtract-containing genes and 03 for gt250 bp RY tract-containing genes) log-intervals plotted as a function of their mean values Gray 22799 human genesblack 1377 brain-specific human genes green 1891 gt100 bp RY tract-containing human genes red 200gt250 bp RY tract-containing human genes

Nucleic Acids Research 2006 Vol 34 No 9 2669

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

gt15 nt unit lengths most of which could be decomposed intoshorter repeat units displaying an evenodd length asymme-try as previously noted for the short (1ndash6mer) repeats (56)Dinucleotides were the most abundant (55 196 copies)followed by tetra- (29 590 copies) mono- (16 686 copies)penta- (8699) and tri-nucleotides (6711) The distribution ofRY tracts revealed an abundance of (AT)n (16 679 copies)over (GC)n (seven copies Figure 4) At present we do notunderstand this paucity of (GC)n runs The distributionswere followed by (GAAATTTC)n (3217 copies) (GATC)n

(2390 copies) and (GGAATTCC)n (2200 copies) All othercombinations of G+A repeats contained lt400 copies There-fore given that poly(A) sequences have the unique propertyof decreasing triplex stability with increasing length (57ndash59)these analyses (Figure 4 and Supplementary Figure 5) indi-cated that the (GAAATTTC)n repeats were over-representedwith respect to the total microsatellite population and wereespecially abundant among the triplex-forming motifs withmirror symmetry However the role if any of such tractsand the underlying ability to form intra- or intermoleculartriplexes remains to be determined

Evolutionary comparisons

To determine whether long pure RY tracts (gt250 bp) areunique to human we queried the chimpanzee dog mouserat and chicken genomes Such RY tracts lengths werefound to be common for all the mammalianavian speciesexamined (Supplementary Table 6) In addition the numberof pure sequences gt50 bp varied from 8000 in chicken togt100 000 in murids attesting to their abundance A search ofthe mouse genome for genes with gt250 bp RY tractsreturned a set of 827 non-redundant entries and indicatedthat the main functional enrichment categories (Supplemen-tary Table 7) largely overlapped those of the gt100 bp RYtract-containing human gene set (Supplementary Table 3)This indicates that RY tracts have been retained within thesame gene families since humanndashmouse divergence 80 mil-lion years ago (Mya) and also that human membrane-associated genes (Supplementary Table 2) highly expressedin brain have maintained longer RY tract lengths thanother gene families

This notwithstanding little or no homology was foundbetween human and mouse orthologous genes when 5 kbfragments flanking the gt250 bp RY tracts of the 228human genes (Supplementary Table 1) were compared Simi-larly analysis of the human and mouse orthologous geneswhich contained gt250 bp RY tracts in both species (24total) revealed little conservation in terms of either thenumber or location of the tracts (Supplementary Table 8)Thus whereas RY tracts have been retained within genefamilies their lengths and locations within individual geneshave diverged substantially

Humanndashchimpanzee orthologous RY tracts

To determine more accurately the extent of sequence conser-vation the human intragenic RY tracts gt250 bp togetherwith plusmn25 kb of flanking sequence were compared withtheir orthologous sites in the chimpanzee genome Thesespecies diverged 5ndash8 Mya from their last common ancestor(LCA) and still share 98 sequence identity Orthologoussequences were identified for most of the 239 tracts(Supplementary Text) A typical dot-plot depicting a com-parison of 5 kb of the human CD99L2 gene with its chim-panzee counterpart is displayed in Figure 5 Sequence

Figure 4 Number of Triplex-forming sequences with mirror symmetry in thehuman genome Gray bars total numbers of uninterrupted RY tracts withmirror symmetry gt30 bp in length

Figure 3 Repetitive elements in the pseudoautosomal region The vertical lines represent the locations of repetitive elements in the first 7 Mbp of the human Xchromosome containing the PAR1 region RY clustered RY repeats from Supplementary Figure 4 MR mirror repeatsgt62 bp DRgt250 direct repeatsgt250 bpIRgt250 inverted repeatsgt250 bp DRgt62 direct repeatsgt62 bp butlt250 bp IRgt62 inverted repeatsgt62 bp butlt250 bp TRBgt1kb sequences containing tandemrepeat blocks gt948 pure with a total length gt1kb red annotated genes XG (in blue) gene spanning the PAR1 boundary

2670 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

identity is evident throughout the region with the exception ofthe R tract which although present in both species is some-what shorter in chimpanzee Of all the 142 RY sequencepairs analyzed most displayed the expected 98 homology

in the RY tract-flanking sequences but none showedsequence conservation within the tracts We conclude thatthe RY tracts have mutated at a much faster rate than thesurrounding sequences Subsequent analyses (SupplementaryFigure 7) provided evidence for a combination of slippedmispairing recombination-mediated duplications nucleotidesubstitutions and possibly also gene conversion in mediatingRY tract divergence

To determine whether RY tracts manifest a length bias inhominids we compared the total length of all pure tractsgt250 bp in both species with their orthologous counterparts(Supplementary Text) 80 of the 582 tracts analyzed werefound to be longer in human than in the chimpanzee(Figure 6) This bias was unlikely to be due to the lower accu-racy of assembly of the chimpanzee genome (36middot coverageversus 6ndash10middot for the human assembly) Not only did manylong chimpanzee RY tracts with sequencing gaps matchlong tracts in the human assembly but a pairwise plot ofall tracts also displayed exponential decay when arrangedby size However these sizes diverged from the curve fitfor 150582 tracts in the chimpanzee but only for 30582 tracts in the human (inset in Figure 6) These resultssuggest that long RY tracts may have been either more read-ily acquired or maintained in the human lineage than in thechimpanzee

Finally the humanndashchimpanzee orthologous searchesrevealed two instances (namely the PRKCB1 and ADAM18genes) in which the RY tract and flanking sequences were

Figure 6 Length comparison of RY tracts between human and chimpanzee The lengths of the tracts in chimpanzee are plotted against the lengths of the orthologoustracts in human The total RY tract lengths are given including interruptions Black dots human pure RY tracts gt250 bp in length versus orthologous tracts inchimpanzee (436 total) red dots chimpanzee pure RY tracts gt250 bp in length versus orthologous tracts in human (166 total) Inset the 582 unique humanchimpanzee pairs of RY tracts from the main panel were ranked by length Black line human RY tracts (average lengthfrac14 359 plusmn 114 bp) red line chimpanzee RYtracts (average length frac14 260 plusmn 127 bp) dotted lines curves fitted to the distributions of RY tracts

Figure 5 Comparative genomics of RY tracts Dot-plot of the R-tract and 5 kbof flanking sequence in the human CD99L2 gene with the orthologous genefrom chimpanzee

Nucleic Acids Research 2006 Vol 34 No 9 2671

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

present in only one of the two species In an attempt to under-stand the possible mutational mechanisms involved we quer-ied the macaque genome (LCA 25 Mya) and analyzed thesequence composition and breakpoint junctions The resultsof these analyses (Supplementary Figures 8 and 9 and Sup-plementary Text) were consistent with the RY tracts havingbeen inserted and deleted as part of larger (35ndash81 kb)DNA fragments through non-homologous end joining reac-tions most of which occurred at IR suggesting that cruciformstructures may have been involved in mediating thesegenomic rearrangements

DISCUSSION

This report describes the distribution of the most prominenthomopurinehomopyrimidine sequences in the human gen-ome Their preferred distribution within genes expressed inthe brain and encoding membrane-bound proteins with cha-nnel and receptor activities contrasts with that of the largestIRs (44) which are mostly associated with testis-expressedgenes on the X and Y chromosomes Whereas IRs areknown to form cruciform structures these long RY tractscontain segments that fulfill the requirements for mirror sym-metry to foster stable triplex structures although other non-BDNA conformations such as slipped structures and occa-sional tetraplexes are also possible Specific types of non-BDNA-forming sequences may therefore be associated withdistinct gene families The RY tract enrichment of thePAR1 region also differs from the distribution of the largestIR which are absent from this portion of the sex chromo-somes and are concentrated instead in downstream regions(44) This suggests that different types of non-B DNA confor-mations may be also functionally distinct Whereas cruci-forms have been proposed to mediate gene conversionthereby contributing to genomic integrity (44) RY tractsmay be involved in stimulating genomic diversity andrecombination by virtue of their ability to fold into triplexesand other conformations

Length polymorphisms have been noted in RY tract-containing alleles in mammals (Supplementary Text) Thehumanndashmouse and humanndashchimpanzee comparisons are con-sistent with the rapid mutation of long RY tracts throughlength and sequence changes driven by dynamic mutationalmechanisms that include slippage recombinationrepair andnucleotide substitution (Supplementary Figure 7) The totallength of all pairs of humanndashchimpanzee RY tracts consid-ered amounts to 213 020 bp for human and 159 020 bp forchimpanzee ie an average length divergence of 0039bases per site per My for the past 65 My This value ismore than an order of magnitude higher than the averagerate of point mutations between these two species(00019) (60) and is likely to exceed the value of 0075reported for subtelomeric segmental duplicationtransferrates during primate evolution (61) if sequence changes andnucleotide substitutions are also included

Large-scale analyses indicate a positive correlation bet-ween point mutation frequency and repeat sequence density(62) implying a role for dynamic mutation in genomicplasticity and hence genome divergence Similarly repetitivesequences increase the frequency of gross rearrange-ments (9103163) by engaging distant sites whereas

triplex-forming oligonucleotides increase nucleotode substi-tution rates (30) Genome-wide surveys have established apositive correlation between RY tracts and both recombina-tion rates and nt diversity (64ndash67) Taken together these find-ings support our contention that RY tracts perhaps by virtueof their ability to fold into unconventional DNA conforma-tions represent hotspots for the generation of DSBs whichmay then provide nucleation sites for chromatin remodelingpathways involving DNA repair and recombination (67)Studies of the relationship between genomic instabilitiesand the presence of repetitive sequences at breakpoint junc-tions suggest an active role for non-B DNA conformations(including slipped structures cruciforms and triplexes) inpromoting DSBs leading to rearrangements both in the con-text of human disease [reviewed in (8) and (163568)] andevolution (6970)

Genes involved in ion transport synaptic transmissionand brain-related functions display accelerated rates ofevolution in hominids when compared with murids (60)Concomitantly human chimpanzee and other non-humanprimates exhibit accelerated changes in gene expression inbrain tissues with increased expression specifically in thehuman lineage (71ndash73) Such acceleration has been sug-gested to be lsquocaused by positive selection that changedthe functions of genes expressed in the brains of humansmore than in the brains of chimpanzeesrsquo (71) The presenceof RY tracts within large brain-expressed genes may havesynergized with increased mutation rates thereby contribut-ing to accelerated sequence divergence these tracts mayalso have potentiated the acquisition of novel transcriptionalactivities

The number of RNA transcripts in the mammalian genomeis estimated to be at least one order of magnitude greater thanthe number of annotated genes (currently 20 000) implyingthat the majority of the mammalian genome is transcribed(74) Transcription on both strands generating sense and anti-sense transcripts with a role in RNA interference and tran-scriptional regulation is over-represented in genes encodingcytoplasmic proteins but under-represented in genes encodingmembrane and extracellular proteins (75) Expansion of a(GAATTC)n triplex-forming motif is associated with genesilencing in Friedreichrsquos ataxia as a result of strong secondarystructure formation (2) It is therefore conceivable that RYtracts particularly those with mirror symmetry such as thehighly recurrent (GAAATTTC)n repeats and other MRwithin the tracts may also contribute to senseantisensetranscriptional regulation (76)

Several RY tract-containing genes encode proteins thatfunction at convergent nodes in signaling pathways atsynapses and also represent susceptibility genes for schizo-phrenia (7778) This association and the realization thatnon-B DNA conformations induce genetic instabilities mayunderscore the delicate balance that exists between the bene-fits of accelerated evolution and the risks associated withacquiring deleterious mutations

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

2672 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

ACKNOWLEDGEMENTS

This research was supported by grants from the NationalInstitutes of Health (NS37554 and ES11347) the Robert AWelch Foundation the Friedreichrsquos Ataxia Research Alliancethe Muscular Dystrophy Foundation (Seek-a-MiracleFoundation) to RDW and in part by the IntramuralResearch Program of the NIH NCI and Federal funds fromthe NCI NIH (contract number N01-CO-12400) DNCacknowledges the financial support of BIOBASE GmbHCertain commercial equipment instruments materials orcompanies are identified in this paper to specify adequatelythe experimental procedure Such identification does not implyrecommendation or endorsement by the National Institute ofStandards and Technology nor does it imply that the materialsor equipment identified are the best available for the purposeThe content of this publication does not necessarily reflect theviews or policies of the Department of Health and HumanServices nor does mention of trade names commercial pro-ducts or organizations imply endorsement by the USGovernment Funding to pay the Open Access publicationcharges for this article was provided by the NIH and theRobert A Welch Foundation

Conflict of interest statement None declared

REFERENCES

1 SindenRR (1994) DNA Structure and Function Academic PressSan Diego CA

2 WellsRD DereR HebertML NapieralaM and SonLS (2005)Advances in mechanisms of genetic instability related to hereditaryneurological diseases Nucleic Acids Res 33 3785ndash3798

3 MirkinSM and Frank-KamenetskiiMD (1994) H-DNA and relatedstructures Annu Rev Biophys Biomol Struct 23 541ndash576

4 RichA and ZhangS (2003) Timeline Z-DNA the long road tobiological function Nature Rev Genet 4 566ndash572

5 NeidleS and ParkinsonGN (2003) The structure of telomeric DNACurr Opin Struct Biol 13 275ndash283

6 SoyferVN and PotamanVN (1996) Triple-Helical Nucleic AcidsSpringer-Verlag New York

7 HurleyLH (2002) DNA and its associated processes as targets forcancer therapy Nature Rev Cancer 2 188ndash200

8 BacollaA and WellsRD (2004) Non-B DNA conformationsgenomic rearrangements and human disease J Biol Chem 27947411ndash47414

9 BacollaA JaworskiA LarsonJE JakupciakJP ChuzhanovaNAbeysingheSS OrsquoConnellCD CooperDN and WellsRD (2004)Breakpoints of gross deletions coincide with non-B DNAconformations Proc Natl Acad Sci USA 101 14162ndash14167

10 WojciechowskaM BacollaA LarsonJE and WellsRD (2005) Themyotonic dystrophy type 1 triplet repeat sequence induces grossdeletions and inversions J Biol Chem 280 941ndash952

11 Kuroda-KawaguchiT SkaletskyH BrownLG MinxPJCordumHS WaterstonRH WilsonRK SilberS OatesRRozenS et al (2001) The AZFc region of the Y chromosome featuresmassive palindromes and uniform recurrent deletions in infertile menNature Genet 29 279ndash286

12 ReppingS SkaletskyH LangeJ SilberS Van Der VeenFOatesRD PageDC and RozenS (2002) Recombination betweenpalindromes P5 and P1 on the human Y chromosome causes massivedeletions and spermatogenic failure Am J Hum Genet 71 906ndash922

13 GotterAL ShaikhTH BudarfML RhodesCH and EmanuelBS(2004) A palindrome-mediated mechanism distinguishes translocationsinvolving LCR-B of chromosome 22q112 Hum Mol Genet 13103ndash115

14 AbeysingheSS ChuzhanovaN KrawczakM BallEV andCooperDN (2003) Translocation and gross deletion breakpointsin human inherited disease and cancer I Nucleotide composition

and recombination-associated motifs Hum Mutat22 229ndash244

15 ChuzhanovaN AbeysingheSS KrawczakM and CooperDN(2003) Translocation and gross deletion breakpoints in human inheriteddisease and cancer II Potential involvement of repetitive sequenceelements in secondary structure formation between DNA endsHum Mutat 22 245ndash251

16 KatoT InagakiH YamadaK KogoH OhyeT KowaHNagaokaK TaniguchiM EmanuelBS and KurahashiH (2006)Genetic variation affects de novo translocation frequency Science311 971

17 WellsRD CollierDA HanveyJC ShimizuM and WohlrabF(1988) The chemistry and biology of unusual DNA structuresadopted by oligopurineoligopyrimidine sequences FASEB J 22939ndash2949

18 Frank-KamenetskiiMD and MirkinSM (1995) Triplex DNAstructures Annu Rev Biochem 64 65ndash95

19 ChanPP and GlazerPM (1997) Triplex DNA fundamentalsadvances and potential applications for gene therapy J Mol Med75 267ndash282

20 InmanRB (1964) Transitions of DNA Homopolymers J Mol Biol9 624ndash637

21 WellsRD and LarsonJE (1972) Buoyant density studies on naturaland synthetic deoxyribonucleic acids in neutral and alkaline solutionsJ Biol Chem 247 3405ndash3409

22 JaishreeTN and WangAH (1993) NMR studies of pH-dependentconformational polymorphism of alternating (C-T)n sequencesNucleic Acids Res 21 3839ndash3844

23 MorganAR and WellsRD (1968) Specificity of the three-strandedcomplex formation between double-stranded DNA and single-strandedRNA containing repeating nucleotide sequences J Mol Biol37 63ndash80

24 KrasilnikovAS PanyutinIG SamadashwilyGM CoxRLazurkinYS and MirkinSM (1997) Mechanisms of triplex-causedpolymerization arrest Nucleic Acids Res 25 1339ndash1346

25 PatelHP LuL BlaszakRT and BisslerJJ (2004) PKD1 intron 21triplex DNA formation and effect on replication Nucleic Acids Res32 1460ndash1468

26 OhshimaK MonterminiL WellsRD and PandolfoM (1998)Inhibitory effects of expanded GAATTC triplet repeats from intron Iof the Friedreich ataxia gene on transcription and replication in vivoJ Biol Chem 275 14588ndash14595

27 KohwiY and PanchenkoY (1993) Transcription-dependentrecombination induced by triple-helix formation Genes Dev 71766ndash1778

28 FaruqiAF DattaHJ CarrollD SeidmanMM and GlazerPM(2000) Triple-helix formation induces recombination in mammaliancells via a nucleotide excision repair-dependent pathway Mol CellBiol 20 990ndash1000

29 LuoZ MacrisMA FaruqiAF and GlazerPM (2000)High-frequency intrachromosomal gene conversion induced bytriplex-forming oligonucleotides microinjected into mouse cellsProc Natl Acad Sci USA 97 9003ndash9008

30 VasquezKM NarayananL and GlazerPM (2000) Specificmutations induced by triplex-forming oligonucleotides in miceScience 290 530ndash533

31 WangG and VasquezKM (2004) Naturally occurringH-DNA-forming sequences are mutagenic in mammalian cellsProc Natl Acad Sci USA 101 13448ndash13453

32 BacollaA JaworskiA ConnorsTD and WellsRD (2001) PKD1unusual DNA conformations are recognized by nucleotide excisionrepair J Biol Chem 276 18597ndash18604

33 PandolfoM and KoenigM (1998) Freidreichrsquos Ataxia In WellsRDand WarrenST (eds) Genetic Instabilities and HereditaryNeurological Diseases Academic Press San Diego CA pp 373ndash398

34 VetcherAA NapieralaM IyerRR ChastainPD GriffithJD andWellsRD (2002) Sticky DNA a long GAAGAATTC triplex that isformed intramolecularly in the sequence of intron 1 of the frataxingene J Biol Chem 277 39217ndash39227

35 RaghavanSC ChastainP LeeJS HegdeBG HoustonSLangenR HsiehCL HaworthIS and LieberMR (2005) Evidencefor a triplex DNA conformation at the bcl-2 major breakpointregion of the t(1418) translocation J Biol Chem280 22749ndash22760

Nucleic Acids Research 2006 Vol 34 No 9 2673

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

36 RaghavanSC and LieberMR (2004) Chromosomal translocationsand non-B DNA structures in the human genome Cell Cycle 3762ndash768

37 RaghavanSC SwansonPC WuX HsiehCL and LieberMR(2004) A non-B-DNA structure at the Bcl-2 major breakpoint region iscleaved by the RAG complex Nature 428 88ndash93

38 KnauertMP LloydJA RogersFA DattaHJ BennettMLWeeksDL and GlazerPM (2005) Distance and affinitydependence of triplex-induced recombination Biochemistry44 3856ndash3864

39 Hoffman-LiebermannB LiebermannD TrouttA KedesLH andCohenSN (1986) Human homologs of TU transposon sequencespolypurinepolypyrimidine sequence elements that can alter DNAconformation in vitro and in vivo Mol Cell Biol 6 3632ndash3642

40 ChristopheD CabrerB BacollaA TargovnikH PohlV andVassartG (1985) An unusually long poly(purine)-poly(pyrimidine)sequence is located upstream from the human thyroglobulin geneNucleic Acids Res 13 5127ndash5144

41 BeheMJ (1987) The DNA sequence of the human beta-globin regionis strongly biased in favor of long strings of contiguous purine orpyrimidine residues Biochemistry 26 7870ndash7875

42 SchrothGP and HoPS (1995) Occurrence of potential cruciform andH-DNA forming sequences in genomic DNA Nucleic Acids Res23 1977ndash1983

43 UsseryD SoumpasisDM BrunakS StaerfeldtHH WorningPand KroghA (2002) Bias of purine stretches in sequencedchromosomes Comput Chem 26 531ndash541

44 WarburtonPE GiordanoJ CheungF GelfandY and BensonG(2004) Inverted repeat structure of the human genome the X-chromosome contains a preponderance of large highly homologousinverted repeats that contain testes genes Genome Res 141861ndash1869

45 CollinsJR StephensRM GoldB LongB DeanM and BurtSK(2003) An exhaustive DNA micro-satellite map of the human genomeusing high performance computing Genomics 82 10ndash19

46 MingY HortonD CohenJC HobbsHH and StephensRM(2006) WholePathwayScope a comprehensive pathway-based analysistool for high-throughput data BMC Bioinformatics 7 30

47 KarlinS and MackenC (1991) Some statistical problems in theassessment of inhomogenesis of DNA sequence data J Am StatistAssoc 86 27ndash35

48 GusevVD NemytikovaLA and ChuzhanovaNA (1999) On thecomplexity measures of genetic sequences Bioinformatics 15994ndash999

49 Van RaayTJ BurnTC ConnorsTD PetriLR GerminoGGKlingerKW and LandesGM (1996) A 25 kb polypyrimidine tract inthe PKD1 gene contains at least 23 H-DNA-forming sequencesMicrob Comp Genomics 1 317ndash327

50 VasquezKM WangG HavrePA and GlazerPM (1999)Chromosomal mutations induced by triplex-forming oligonicleotides inmammalian cells Nucleic Acids Res 27 1176ndash1181

51 WangG SeidmanMM and GlazerPM (1996) Mutagenesis inmammalian cells induced by triple helix formation andtranscription-coupled repair Science 271 802ndash805

52 FilatovDA and GerrardDT (2003) High mutation rates in humanand ape pseudoautosomal genes Gene 317 67ndash77

53 PerryJ PalmerS GabrielA and AshworthA (2001) A shortpseudoautosomal region in laboratory mice Genome Res 111826ndash1832

54 NapieralaM DereR VetcherA and WellsRD (2004) Structure-dependent recombination hot spot activity of GAATTC sequencesfrom intron 1 of the Friedreichrsquos ataxia gene J Biol Chem 2796444ndash6454

55 SakamotoN ChastainPD ParniewskiP OhshimaK PandolfoMGriffithJD and WellsRD (1999) Sticky DNA self-associationproperties of long GAAaTTC repeats in R R Y triplex structures fromFriedreichrsquos ataxia Molecular Cell 3 465ndash475

56 SubramanianS MishraRK and SinghL (2003) Genome-wideanalysis of microsatellite repeats in humans their abundance anddensity in specific genomic regions Genome Biol 4 R13

57 SandstromK WarmlanderS GraslundA and LeijonM (2002)A-tract DNA disfavours triplex formation J Mol Biol 315737ndash748

58 RobertsRW and CrothersDM (1996) Prediction of thestability of DNA triplexes Proc Natl Acad Sci USA93 4320ndash4325

59 JamesPL BrownT and FoxKR (2003) Thermodynamic and kineticstability of intermolecular triple helices containing differentproportions of C+GC and TAT triplets Nucleic Acids Res31 5598ndash5606

60 MikkelsenTS HillierLW EichlerEE ZodyMC JaffeDBYangS-P EnardW HellmannI Lindblad-TohKAltheideTK et al (2005) Initial sequence of the chimpanzeegenome and comparison with the human genome Nature437 69ndash87

61 LinardopoulouEV WilliamsEM FanY FriedmanC YoungJMand TraskBJ (2005) Human subtelomeres are hot spots ofinterchromosomal recombination and segmental duplication Nature437 94ndash100

62 ChiaromonteF YangS ElnitskiL YapVB MillerW andHardisonRC (2001) Association between divergence and interspersedrepeats in mammalian noncoding genomic DNA Proc Natl Acad SciUSA 98 14503ndash14508

63 MeservyJL SargentRG IyerRR ChanF McKenzieGJWellsRD and WilsonJH (2003) Long CTG tracts from the myotonicdystrophy gene induce deletions and rearrangements duringrecombination at the APRT locus in CHO cells Mol Cell Biol23 3152ndash3162

64 KongA GudbjartssonDF SainzJ JonsdottirGMGudjonssonSA RichardssonB SigurdardottirS BarnardJHallbeckB MassonG et al (2002) A high-resolutionrecombination map of the human genome Nature Genet31 241ndash247

65 Jensen-SeamanMI FureyTS PayseurBA LuY RoskinKMChenCF ThomasMA HausslerD and JacobHJ (2004)Comparative recombination rates in the rat mouse and humangenomes Genome Res 14 528ndash538

66 HellmannI PruferK JiH ZodyMC PaaboS and PtakSE(2005) Why do human diversity levels vary at a megabase scaleGenome Res 15 1222ndash1231

67 MyersS BottoloL FreemanC McVeanG and DonnellyP (2005)A fine-scale map of recombination rates and hotspots across the humangenome Science 310 321ndash324

68 WellsRD and WarrenST (1998) Genetic Instabilities andHereditary Neurological Diseases Academic Press San Diego CA

69 Kehrer-SawatzkiH SandigC ChuzhanovaN GoidtsVSzamalekJM TanzerS MullerS PlatzerM CooperDN andHameisterH (2005) Breakpoint analysis of the pericentric inversiondistinguishing human chromosome 4 from the homologouschromosome in the chimpanzee (Pan troglodytes) Hum Mutat25 45ndash55

70 SzamalekJM GoidtsV ChuzhanovaN HameisterH CooperDNand Kehrer-SawatzkiH (2005) Molecular characterisation of thepericentric inversion that distinguishes human chromosome 5 from thehomologous chimpanzee chromosome Hum Genet 117168ndash176

71 KhaitovichP HellmannI EnardW NowickK LeinweberMFranzH WeissG LachmannM and PaaboS (2005) Parallelpatterns of evolution in the genomes and transcriptomes of humans andchimpanzees Science 309 1850ndash1854

72 PreussTM CaceresM OldhamMC and GeschwindDH (2004)Human brain evolution insights from microarrays Nature Rev Genet5 850ndash860

73 UddinM WildmanDE LiuG XuW JohnsonRM HofPRKapatosG GrossmanLI and GoodmanM (2004) Sister grouping ofchimpanzees and humans as revealed by genome-wide phylogeneticanalysis of brain gene expression profiles Proc Natl Acad Sci USA101 2957ndash2962

74 CarninciP KasukawaT KatayamaS GoughJ FrithMCMaedaN OyamaR RavasiT LenhardB WellsC et al (2005) Thetranscriptional landscape of the mammalian genome Science 3091559ndash1563

75 KatayamaS TomaruY KasukawaT WakiK NakanishiMNakamuraM NishidaH YapCC SuzukiM KawaiJ et al (2005)Antisense transcription in the mammalian transcriptome Science309 1564ndash1566

2674 Nucleic Acids Research 2006 Vol 34 No 9

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from

76 FabregatI KochKS AokiT AtkinsonAE DangH AmosovaOFrescoJR SchildkrautCL and LeffertHL (2001) Functionalpleiotropy of an intramolecular triplex-forming fragmentfrom the 30-UTR of the rat Pigr gene Physiol Genomics5 53ndash65

77 HarrisonPJ and WeinbergerDR (2005) Schizophrenia genes geneexpression and neuropathology on the matter of their convergenceMol Psychiatry 10 40ndash68

78 LewisDA HashimotoT and VolkDW (2005) Cortical inhibitoryneurons and schizophrenia Nature Rev Neurosci 6 312ndash324

Nucleic Acids Research 2006 Vol 34 No 9 2675

by guest on October 12 2015

httpnaroxfordjournalsorgD

ownloaded from