17
10.1101/gr.167301 Access the most recent version at doi: 2001 11: 389-404 Genome Res. Qiang Wu, Theresa Zhang, Jan-Fang Cheng, et al. Protocadherin Gene Clusters Comparative DNA Sequence Analysis of Mouse and Human References http://genome.cshlp.org/content/11/3/389.full.html#ref-list-1 This article cites 39 articles, 18 of which can be accessed free at: License Commons Creative http://creativecommons.org/licenses/by-nc/3.0/. described at a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as ). After six months, it is available under http://genome.cshlp.org/site/misc/terms.xhtml first six months after the full-issue publication date (see This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the Service Email Alerting click here. top right corner of the article or Receive free email alerts when new articles cite this article - sign up in the box at the http://genome.cshlp.org/subscriptions go to: Genome Research To subscribe to Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.org Downloaded from Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.org Downloaded from

Comparative DNA Sequence Analysis of Mouse and Human

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

10.1101/gr.167301Access the most recent version at doi:2001 11: 389-404 Genome Res. 

  Qiang Wu, Theresa Zhang, Jan-Fang Cheng, et al.   Protocadherin Gene ClustersComparative DNA Sequence Analysis of Mouse and Human

  References

  http://genome.cshlp.org/content/11/3/389.full.html#ref-list-1

This article cites 39 articles, 18 of which can be accessed free at:

  License

Commons Creative

  http://creativecommons.org/licenses/by-nc/3.0/.described at

a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as ). After six months, it is available underhttp://genome.cshlp.org/site/misc/terms.xhtml

first six months after the full-issue publication date (see This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the

ServiceEmail Alerting

  click here.top right corner of the article or

Receive free email alerts when new articles cite this article - sign up in the box at the

http://genome.cshlp.org/subscriptionsgo to: Genome Research To subscribe to

Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

Comparative DNA Sequence Analysis of Mouseand Human Protocadherin Gene ClustersQiang Wu,1 Theresa Zhang,2 Jan-Fang Cheng,3 Youngwook Kim,1

Jane Grimwood,4 Jeremy Schmutz,4 Mark Dickson,4 James P. Noonan,4

Michael Q. Zhang,2 Richard M. Myers,4 and Tom Maniatis1,5

1Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA;2Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; 3Genome Sciences Department,Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA; 4Department of Geneticsand The Stanford Human Genome Center, Stanford University School of Medicine, Stanford, California 94305, USA

The genomic organization of the human protocadherin �, �, and � gene clusters (designated Pcdh� [gene symbolPCDHA], Pcdh� [PCDHB], and Pcdh� [PCDHG]) is remarkably similar to that of immunoglobulin and T-cellreceptor genes. The extracellular and transmembrane domains of each protocadherin protein are encoded by anunusually large “variable” region exon, while the intracellular domains are encoded by three small “constant”region exons located downstream from a tandem array of variable region exons. Here we report the results of acomparative DNA sequence analysis of the orthologous human (750 kb) and mouse (900 kb) protocadheringene clusters. The organization of Pcdh� and Pcdh� gene clusters in the two species is virtually identical, whereasthe mouse Pcdh� gene cluster is larger and contains more genes than the human Pcdh� gene cluster. Weidentified conserved DNA sequences upstream of the variable region exons, and found that these sequences aremore conserved between orthologs than between paralogs. Within this region, there is a highly conserved DNAsequence motif located at about the same position upstream of the translation start codon of each variableregion exon. In addition, the variable region of each gene cluster contains a rich array of CpG islands, whoselocation corresponds to the position of each variable region exon. These observations are consistent with theproposal that the expression of each variable region exon is regulated by a distinct promoter, which is highlyconserved between orthologous variable region exons in mouse and human.

[The sequence data described in this paper have been submitted to the GenBank/EMBL/DDBJ data library underaccession nos. AY013756–AY013813, AY013873–AY013878, AF332005, and AF332006.]

Cadherin superfamily proteins are calcium-dependentcell-adhesion molecules that have been implicated intissue morphogenesis during embryonic developmentand in the maintenance of selective neuronal connec-tions in the adult brain (Dreyer and Roman-Dreyer1999; Shapiro and Colman 1999; Steinberg and Mc-Nutt 1999; Bruses 2000; Gumbiner 2000; Yagi andTakeichi 2000). Classic cadherins and protocadherinsare two subfamilies within the cadherin superfamily(Suzuki 1996; Nollet et al. 2000; Wu and Maniatis2000). Classic cadherins have five ectodomain repeats,a transmembrane segment, and a conserved cytoplas-mic domain that interacts with �-catenin. In contrast,protocadherins have six or more ectodomain repeats,which are encoded by unusually large exons, and haveother sequence features that distinguish them from theclassic cadherins, including distinct intracellular do-mains (Suzuki 1996; Wu and Maniatis 2000).

A new set of mouse protocadherin cDNA clones,

designated CNR, was previously isolated in a yeast two-hybrid screen that used the Fyn tyrosine kinase as bait.The CNR proteins are expressed at synaptic junctionsin different regions of the adult brain, and individualneurons appear to express a distinct subset of CNRmRNAs (Kohmura et al. 1998). A remarkable feature ofthese protocadherin cDNAs is that the sequence of the5� region of each cDNA, which encodes the extracellu-lar and transmembrane domains, differs from eachother, whereas the 3� region of each cDNA, which en-codes the intracellular Fyn-interaction domain, is iden-tical.

To investigate the mechanism of cell-specific pro-tocadherin gene expression, we determined the ge-nomic organization of the human protocadherin genes(Wu and Maniatis 1999; also see human genes in Fig.1). Three closely linked human protocadherin geneclusters, designated Pcdh� (which are the orthologs ofthe mouse CNR genes), Pcdh�, and Pcdh�, were identi-fied in the 5q31 region of human chromosome 5. Re-markably, the variable 5� region of each human pro-tocadherin cDNA was found to be encoded by a differ-

5Corresponding author.E-MAIL [email protected]; FAX (617) 495 3537.Article and publication are at www.genome.org/cgi/doi/10.1101/gr.167301.

Letter

11:389–404 ©2001 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org Genome Research 389www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

ent large exon, and the variable region exons areorganized in a tandem array in each gene cluster. Se-quence comparisons of genomic DNA and cDNAsidentified three small exons downstream from Pcdh�

and Pcdh� variable region exons that encode the com-mon 3� region of protocadherin cDNAs and were there-fore designated Pcdh� and Pcdh� constant region ex-ons. Surprisingly, the Pcdh� gene cluster does not con-tain constant region sequences. Thus, all Pcdh� genesconsist of a single exon that encodes the extracellular,transmembrane and short cytoplasmic domains of theprotein. Further studies revealed that each of the vari-able region exons of Pcdh� and Pcdh� gene clusters isindependently spliced to the respective three constantregion exons. Therefore, all of the protocadherin pro-teins encoded in the Pcdh� and Pcdh� gene clustershave similar but non-identical N-terminal extracellularand transmembrane domains, whereas the identical C-terminal cytoplasmic domains within each cluster areencoded by the constant region exons unique to eachcluster. This variable and constant region organizationof Pcdh� and Pcdh� proteins suggests that diverse ex-tracellular signals could converge on a single cytoplas-mic signal transduction pathway. We noted that theorganization of the Pcdh� and Pcdh� gene clusters isstrikingly similar to that of both the immunoglobulin(Ig) and T-cell receptor (TCR) gene clusters (Wu andManiatis 1999). Comparison of genomic and cDNA se-quences of Pcdh� and Pcdh� genes suggests that thepatterns of cell-specific expression of individual pro-tocadherin protein are established by a novel mecha-nism. Subsequently, an almost identical organizationwas reported for the mouse CNR (mouse Pcdh�) genecluster (Sugino et al. 2000).

A puzzling feature of the human Pcdh� and Pcdh�

gene clusters is the presence of variable region exonsnear the end of the two gene clusters that are moresimilar to each other than to the other variable regionsequences within each cluster. These exons were des-ignated Pcdh�-C1 and -C2 in the Pcdh� gene cluster,and Pcdh�-C3, -C4, and -C5 in the Pcdh� gene cluster. Incontrast, the Pcdh� gene cluster does not have a C-typeprotocadherin variable region sequence. All membersof the Pcdh� gene cluster are very similar to each otherand have features distinct from members of the Pcdh�

and Pcdh� gene clusters (Wu and Maniatis 1999).Protocadherin genes are expressed in specific re-

gions of the brain (Kohmura et al. 1998; Hirano et al.1999a,b; Yamagata et al. 1999; Redies 2000), and theyhave been proposed to be a part of the molecular codefor establishing and maintaining specific neuronalconnections in the brain (Hagler and Goda 1998;Dreyer and Roman-Dreyer 1999; Serafini 1999; Shapiroand Colman 1999; Wu and Maniatis 1999). An under-standing of the mechanism of cell-specific protocad-herin gene expression may therefore provide insights

into the specificity of neuronal cell–cell connectionsduring development and in response to cognitive andsensory inputs. On the basis of the unusual genomicorganization of protocadherin gene clusters, we pro-posed four models for the cell-specific expression ofprotocadherins, which included a cell-specific DNA re-arrangement, and cis- or trans- alternative splicingmechanisms (Wu and Maniatis 1999).

Here we report the complete DNA sequence of themouse protocadherin gene clusters on chromosome 18and present a comparative analysis of the mouse andhuman protocadherin gene clusters. This sequencecomparison provides insights into the mechanism ofprotocadherin gene expression, and the mouse se-quence will provide information necessary for studiesin the more experimentally tractable mouse model. Wehave identified ∼60 mouse protocadherin genes in thisregion, and find that the overall organization of themouse and human Pcdh� and Pcdh� gene clusters isessentially identical (Fig. 1). However, the mouse Pcdh�

gene cluster has six more genes than the correspondinghuman Pcdh� cluster. Comparative analysis of inter-genic regions revealed sequences upstream of eachvariable region exon that are highly conserved be-tween human and mouse, but less conserved betweengenes within each gene cluster in either human ormouse. In addition, the pattern of CpG island distri-bution corresponds with that of variable region exons.These observations suggest that each variable regionexon is transcribed from its own promoter.

RESULTS

Genomic Organization of the Mouse ProtocadherinGene Clusters on the 18c Region of Chromosome 18Based on the organization of human protocadheringene clusters in the 5q31 region of chromosome 5, andavailable mouse cDNA and EST sequence informationfrom GenBank, we designed 19 pairs of PCR primers toamplify genomic DNA containing the homologousmouse protocadherin genes. We used these primers toscreen a mouse BAC genomic DNA library (RPCI-23),and isolated 21 BAC clones containing sequences ofthe mouse protocadherin gene clusters. From the re-striction maps of these BAC clones, seven minimallyoverlapping clones were selected for DNA sequencing(RPCI-23_193o23, 6p18, 72c14, 92d17, 161o8, 56b11,and 19k11) (Fig. 1A). The total extent of genomic DNAincluded in the seven BACs (excluding the overlappingregions) was estimated by pulse-field gel electrophore-sis to be ∼1MB. All seven clones were mapped by fluo-rescence in situ hybridization (FISH) to the 18c regionof mouse chromosome 18, which is homologous to the5q31 region of human chromosome 5.

Analysis of the mouse genomic DNA sequences re-vealed 14 Pcdh� genes that are highly similar to the

Wu et al.

390 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

human Pcdh� genes. The variable region exons of thesemouse Pcdh� genes are organized in a tandem arrayspanning a region of 250 kb mouse genomic DNA. Likethe human protocadherin gene clusters, the constantregion of the mouse Pcdh� gene cluster is organizedinto three small exons located downstream from thevariable region tandem array (Fig. 1B). Following thePcdh� gene cluster there is a second cluster of mousePcdh� genes, which is followed in turn by a third clus-ter of Pcdh� genes. Like the human Pcdh� gene cluster,no constant region exons were found for the mousePcdh� gene cluster (Fig. 1C). However, three small con-stant region exons are located downstream of the

mouse Pcdh� variable region exons (Fig.1D). Thus, the overall genomic organiza-tion of the three protocadherin gene clus-ters is highly conserved between mouseand human. In total, we identified ∼60 pro-tocadherin genes in this region. The up-stream and downstream limits of the geneclusters were defined by the presence of ahistidyl-tRNA synthetase homologous gene(O’Hanlon et al. 1995) upstream of thevariable region exon of Pcdh�1, and a non-syndromic deafness (diaphanous) gene(Lynch et al. 1997) downstream from thePcdh� constant region exon 3. Thesenoncadherin genes are also conserved be-tween human and mouse.

Comparison of the Organization of theMouse and Human Pcdh� Gene ClustersSequence analysis of the genomic DNAcontaining the mouse Pcdh� genes revealed14 large variable region exons encoding theprotocadherin extracellular and transmem-brane domains highly similar to those ofthe human Pcdh� proteins (Fig. 1B). Se-quencing of the cDNA fragments of allmouse Pcdh� genes confirmed the consen-sus splice sites at the ends of all 14 variableregion exons (Fig. 2A). The first 12 mousePcdh� genes are highly similar to eachother, and eight of them are identical tothe previously cloned mouse protocad-herin genes (Kohmura et al. 1998). The lasttwo mouse Pcdh� genes (Pcdh�-C1 and -C2)are highly similar to the last two humanPcdh� genes. Like the corresponding hu-man genes, mouse Pcdh�-C1 and -C2 genesare more similar to each other than to the12 upstream Pcdh� genes. Similar to the or-ganization of human Pcdh� constant re-gion, the three small mouse Pcdh� constantregion exons are located ∼10 kb down-stream from the last variable region exon.

The constant region exons of Pcdh� are highly con-served between mouse and human. Specifically, thenucleotide sequences of constant region exons 1, 2,and 3 are 92%, 99%, and 89% identical between mouseand human, respectively. Moreover, both human andmouse Pcdh� constant regions have two alternativelyspliced forms (Sugino et al. 2000).

Although there is one less variable region exon inthe mouse Pcdh� gene cluster, as compared to human,the gene order is essentially conserved between mouseand human (Fig. 1B). However, the distance betweensome orthologous genes in mouse is very differentfrom that in human. For example, the distance be-

Figure 1 Comparison of the organization of mouse and human protocadheringene clusters. Shown are the genomic organization of three closely linked mouseprotocadherin gene clusters (A) and comparisons of the genomic organization ofmouse and human Pcdh�/CNR (B), Pcdh� (C), and Pcdh� (D) gene clusters. TheBAC clones used in the sequence analysis are shown below (A). The length ofsequences between clusters is also shown in (A). Each gene family contains mul-tiple tandem variable region exons indicated by a vertical color bar: (mauve)Pcdh� variable region exons; (turquoise) Pcdh� genes; (orange) Pcdh�-b variableregion exons; (green) Pcdh�-a variable region exons; (yellow) C-type Pcdh variableregion exons (present in both the Pcdh� and Pcdh� gene clusters); (blue) relic orpseudogene variable region sequences (present in all three gene clusters); (pink)constant region exons. Abbreviations: Pcdh, protocadherin; V, variable region; C,constant region; M, mouse; H, human; r, relic; �, pseudogene.

Comparison of Mouse and Human Protocadherin Genes

Genome Research 391www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

tween mouse Pcdh�4 and Pcdh�5 genes is only 5 kb incontrast to the large 12 kb intergenic region betweenthe corresponding human genes. Three “relic” se-quences were identified in the mouse Pcdh� gene clus-ter, and only one pseudogene was identified in thecorresponding human cluster. Relics are defined as se-quence fragments with only limited similarity to thecorresponding functional genes (Rowen et al. 1996). Incontrast, pseudogenes show more extensive sequencesimilarity but are rendered nonfunctional by muta-tions.

Comparison of the Organization of Humanand Mouse Pcdh� Gene ClustersSequence analysis of the genomic DNA downstreamfrom the mouse Pcdh� gene cluster revealed a large

exon located ∼77 kb downstream from the last Pcdh�

constant region exon (Fig. 1A).This single large exon encodes an 818aa protein

containing a signal peptide, six typical protocadherinectodomains, a transmembrane segment, and a shortcytoplasmic domain. The encoded protein is highlysimilar to the human Pcdh�1 protein: 88% identity and92% similarity with no gaps over the entire length.Thus, we designated this gene mouse Pcdh�1. Follow-ing the mouse Pcdh�1 gene, there are 21 additionalPcdh� genes that are more similar to the human Pcdh�

genes than to the human Pcdh� and Pcdh� genes. Wehave therefore designated these genes mouse cdh�2–Pcdh�22 (Fig. 1C). We previously identified 15 Pcdh�

genes in the human Pcdh� locus. We have now isolateda clone (CTD-2130B15) that covers the gap betweenthe human Pcdh�8 and Pcdh�9 genes, and found thatthe gap sequence contains only one additional Pcdh�

gene (therefore designated Pcdh�8a). Thus, mouse hassix more Pcdh� genes than human does, and the Pcdh�

locus is expanded in mouse compared to that in hu-man (Fig. 1C).

The predicted amino acid sequences of the mousePcdh� proteins are more similar to each other than tothe mouse Pcdh� or Pcdh� proteins. The Pcdh� pro-teins have highly conserved extracellular and trans-membrane domains. The nucleotide and amino acidsequences in the region around the transmembranedomains of Pcdh� proteins are almost identical, andthese proteins have a very short cytoplasmic domain.In contrast to the Pcdh� and Pcdh� gene clusters, nei-ther mouse nor human Pcdh� gene clusters containconstant region exons. Moreover, all of the Pcdh� ESTand cDNA clones currently in the GenBank databasecorrespond to unspliced mRNAs. Therefore, Pcdh� pro-teins do not appear to contain a common C-terminalintracellular domain. However, we noted that a highlyconserved 5� splice site is located at the end of mostPcdh� variable region exons (Wu and Maniatis 1999),and this splice site is conserved between mouse andhuman (data not shown). Thus, it seems likely that theconserved Pcdh� 5� splice sites do function. However,neither the cell type in which this splicing occurs northe target 3� splice site has been identified.

Identification of Two Noncadherin Genesbetween the Pcdh� and Pcdh� Gene ClustersBoth the mouse and human protocadherin gene clus-ters are interrupted by two noncadherin-like genes lo-cated between the Pcdh� and Pcdh� gene clusters. Thefirst gene is an ornithine transporter gene (ORNT2),and the second gene encodes a component (TAFII55)of the human TFIID complex (Fig. 1A). The coding re-gions of both genes are located on the opposite strandthat encodes the protocadherins. The mitochondrialornithine transporter 1 (ORNT1) gene, which is defec-

Figure 2 Alignments of variable region 5� splice sites of mousePcdh� (A) and Pcdh� (B) gene clusters. The 5� splice site se-quences are shown in bold, with the consensus below eachpanel.

Wu et al.

392 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

tive in hyperornithinemia–hyperammonemia-homocitrullinuria syndrome, had been previouslymapped to human chromosome 13q14 (Camacho etal. 1999). The human ORNT2 gene is a paralog of thehuman ORNT1 gene and has a full-length coding re-gion. However, the corresponding mouse ORNT2 genehas a single nucleotide deletion near the 5� end of thecoding region. This single nucleotide deletion is not aconsequence of sequencing error, because three ge-nome-sequencing centers independently determinedthe same sequence. Thus, the mouse Ornt2 gene maybe a pseudogene as a consequence of a very recent mu-tation. Alternatively, a second methionine codon lo-cated 107 nucleotides downstream from the first onemay actually be the translational start codon. If so, thesingle nucleotide deletion in the mouse sequencewould not inactivate the gene. Both human and mousegenes are transcribed because they have numerous ESTmatches in the database. The TAFII55 gene, which en-codes a subunit of TFIID complex (Chiang and Roeder1995), consists of a single exon located between thePcdh� and Pcdh� gene clusters in both mouse and hu-man.

Comparison of the Organization of Humanand Mouse Pcdh� Gene ClustersDNA sequence analysis identified 22 mouse Pcdh� vari-able region exons and three small constant region ex-ons in the region downstream from the Pcdh� genecluster (Fig. 1A,D). One of the mouse Pcdh� genes isidentical to previously cloned protocadherin 2C gene(Hirano et al. 1999a). Sequencing of cDNAs spanningthe splice sites between variable and constant regionsconfirmed that cDNA fragments of all mouse Pcdh�

genes share an identical constant region sequence.Thus, each variable region exon is independentlyspliced to the first constant region exon. Comparisonof the sequences of cDNAs with those of the genomicDNA identified a consensus splice site downstreamfrom each variable region exon (Fig. 2B).

The organization of mouse Pcdh� gene cluster isessentially the same as that of human Pcdh� gene clus-ter (Fig. 1D). Both have >20 variable region exonsand both have three downstream constant region ex-ons. The constant region exon sequences are highlyconserved between mouse and human. Specifically,constant region exons 1, 2, and 3 have 95%, 90%, and80% identity, respectively, between mouse and humanat the nucleotide level. In addition, we found that eachof the mouse Pcdh� genes has the corresponding or-thologous human gene except the mouse Pcdh�-b8gene, whose orthologous gene is the human Pcdh�3gene. Moreover, the mouse has a relic sequence atthe location corresponding to the human Pcdh�-b3gene. Similar to the Pcdh� gene cluster, the last threePcdh� genes (C3, C4, and C5) are conserved between

mouse and human (Fig. 1D). All five mouse C-typeprotocadherin genes, C1 and C2 in the Pcdh� clusterand C3, C4, and C5 in the Pcdh� cluster, are similar toeach other and are distinct from other members in theclusters.

Evolutionary Relationships among Members ofthe Human and Mouse Pcdh�, Pcdh�, and Pcdh� GenesThe proteins encoded by the protocadherin loci in hu-man and mouse are highly similar. The evolutionaryrelationships between human and mouse Pcdh� genesare displayed in Figure 3A. The phylogenetic treeshows that most individual Pcdh� genes are ortholo-gous between human and mouse. Thus, it is likely thateach Pcdh� protein has a distinct, highly conservedfunction. However, the human Pcdh�7 and Pcdh�9,and the mouse Pcdh�7 and Pcdh�8 genes are paralo-gous, and the four genes are within a small branch inthe tree. Therefore, the human Pcdh�7 and Pcdh�9,and the mouse Pcdh�7 and Pcdh�8 genes are probablythe consequence of duplications of their respective an-cestors after divergence of primates and rodents. More-over, human Pcdh�6 and Pcdh�8 are paralogous, andthere is a single orthologous mouse Pcdh�6 gene. Thisobservation suggests that human Pcdh�6 and Pcdh�7,and Pcdh�8 and Pcdh�9 are duplicated from a singleancestral gene pair. The Pcdh�-c1 and Pcdh�-c2 variableregions are distinct from other Pcdh� proteins, andtheir high conservation between human and mousestrongly suggests that they have specific functions dis-tinct from those of the other Pcdh� genes.

The evolutionary relationships between humanand mouse Pcdh� genes are displayed in Figure 3B. Thehuman and mouse Pcdh� genes display both ortholo-gous and paralogous relationships. For example, thehuman Pcdh�1, 2, 3, 6, 7, 13, 14, and 15 genes appearto be the orthologs of the mouse Pcdh�1, 2, 3, 13, 15,8, 20, and 22, respectively. However, three mousePcdh� genes (5, 7, and 9) are paralogous and in a smallbranch with the human Pcdh�4 gene, and six mousePcdh� genes (4, 6, 8, 10, 11, and 12) are paralogous andin a small branch with a single human Pcdh�5 gene.This observation suggests that the mouse Pcdh� genecluster expanded after the divergence of mouse andhuman.

In contrast to both Pcdh� and Pcdh� genes, mem-bers of Pcdh� genes are strictly conserved betweenmouse and human. As shown in Figure 3C, each mousegene and its human ortholog form a small branch inthe phylogenetic tree. Therefore, members of Pcdh�

gene cluster are orthologous between mouse and hu-man. However, the mouse ortholog of human Pcdh�-b3 gene has degenerated into a relic sequence, and thehuman ortholog of mouse Pcdh�-b8 has become apseudogene (Fig. 1D).

The overall organization of the protocadherin

Comparison of Mouse and Human Protocadherin Genes

Genome Research 393www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

gene clusters in mouse and human is essentially thesame. First, both mouse and human have three pro-tocadherin gene clusters, in the same order and orien-

tation (Fig.1). Second, the C-type protocadherin genes,the last two Pcdh� genes and the last three Pcdh� genes,are more similar to each other, and are separated from

Figure 3 Phylogenetic trees of human and mouse Pcdh� (A), Pcdh� (B), and Pcdh� (C) gene clusters. The trees were reconstructed usingthe neighbor-joining method of the PAUP program. The tree branches are labeled with the percentage support for that partition basedon 1000 bootstrap replicates. Only bootstrap values of >50% are shown. The unrooted trees are rooted by midpoint prior to output.

Wu et al.

394 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

Figure 4 Distribution of CpG islands in the genomic sequences of human and mouse protocadherin gene clusters. Shown are ratios ofobserved to expected CpG dinucleotide frequency of a 1000 bp sliding window in the region of human Pcdh� (A), Pcdh� (B), and Pcdh�(C) and mouse Pcdh� (D), Pcdh� (E), and Pcdh� (F) gene clusters. The peak of ratios correlates with the position of protocadherin variableregion exons but not constant region exons. The position of each variable and constant region exon is indicated at the top of each panel.(CT), constant region exon.

Comparison of Mouse and Human Protocadherin Genes

Genome Research 395www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

Wu et al.

396 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

(See

follo

win

gpa

gefo

rle

gend

.)

Comparison of Mouse and Human Protocadherin Genes

Genome Research 397www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

corresponding upstream genes by a very large inter-genic region (>40 kb) in both mouse and human (Fig.1B and 1D). Third, the members of the Pcdh� andPcdh� gene clusters are strikingly conserved in bothgene order (Fig. 1B,D) and gene sequences (Fig. 3A,C).Finally, the Pcdh� and Pcdh� gene clusters have highlyconserved constant region exons between mouse andhuman whereas the Pcdh� gene cluster does not haveconstant region exons in both mouse and human(Fig. 1).

The Distribution of CpG Islands Correspondsto the Locations of the Variable Region ExonsAt present, it is not known whether each protocad-herin gene cluster is transcribed from a single pro-moter, or whether each variable region exon has itsown promoter. Insights into this problem could beprovided by examining the sequences immediatelysurrounding each variable region in the mouse andhuman protocadherin gene clusters. One characteristicshared by ∼50% of mammalian promoters is the occur-rence of CpG islands located near the 5� ends of genes(Antequera and Bird 1993). Close examination of thesequences around the translation start sites of mouseand human protocadherin variable region exons re-vealed a high density of CpG dinucleotides, suggestingthat they are CpG islands. Indeed, the sequences nearthe human Pcdh�2, Pcdh�1, Pcdh�-a10, and Pcdh�-b3translation start codons match four previously isolatedCpG islands (Cross et al. 1994) (GenBank accessionnos. Z65300, Z59266, Z60764, and Z58035, respectively).

We therefore searched the entire human andmouse gene clusters for CpG islands using the CpG-plot program (Larsen et al. 1992). As shown in Figure4, the ratio of observed to expected CpG dinucleotidefrequency peaks at the locations of each variable regionexon in both mouse and human. It is known thatmouse genome lost some CpG dinucleotides after thedivergence of mouse and human (Antequera and Bird1993). Consistent with this, we note that the ratio isslightly lower in mouse than in human (comparing

Fig. 4A,B,C to 4D,E,F, respectively). Nevertheless, thisdistribution supports the proposal that each variableregion exon has its own promoter and a transcriptionalstart site is located upstream from each variable regionexon.

Noncoding Sequence ConservationWithin the Variable Region of Mouseand Human Protocadherin Gene ClustersWe used the PipMaker program (Schwartz et al. 2000)to compare sequences of the entire mouse and humanPcdh� and Pcdh� gene clusters (Fig. 5). Interestingly,the first two relics (r1 and r2) in the mouse Pcdh� genecluster appear to result from interruption of an archaicprotocadherin gene by repetitive elements (Fig. 5A).Although there are many conserved intergenic se-quences in the protocadherin variable region, the moststriking features are the occurrence of highly conservedsequences upstream of each variable region exon (Fig.5A,B). For example, in the Pcdh� variable region, al-most all conserved segments above 70% identity andlonger than 100 base pairs (bp) are immediately up-stream of variable coding regions.

A systematic analysis of these sequences revealedthat the 5� flanking sequences of orthologous variableregion exons have a significantly higher percentageidentity than the corresponding paralogous sequenceswithin Pcdh� and Pcdh� gene clusters in both mouseand human (Fig. 6A,B). In both the Pcdh� and Pcdh�

gene clusters, there is a peak of sequence identity at theregion ∼200 bp upstream from the translation startcodon. In contrast, a lower level of sequence identity,which is only slightly above the baseline for randomsequences, is observed in the upstream sequences be-tween the paralogous genes within either Pcdh� orPcdh� gene cluster in both human and mouse (brokenlines in Fig. 6A,B). We also observed that some variableregion exons have a conserved element further up-stream of the coding region. These results are consis-tent with the notion that there is a distinct promoterupstream of each variable region exon. The high levelof sequence conservation upstream of variable regionexons is in contrast to the sequences downstream fromthe variable region 5� splice site, in which there is noconservation of sequences between the two species.

For the sequences upstream of the C-type pro-tocadherin variable region exons, not only does eachorthologous gene pair have a higher sequence identitythan paralogous gene pairs, but also the conserved re-gions are much larger than those of other Pcdh� andPcdh� genes (Fig. 6C). Although there is no conservedsegment above 70% identity and longer than 100 bp atthe 5� segment flanking the C1 protocadherin gene,there are five, three, three, and two such highly con-served segments upstream of the C2, C3, C4, and C5genes, respectively. This observation suggests that the

Figure 5 Percent identity plot (PIP) of the Pcdh� (A) and Pcdh�(B) genomic sequences between mouse and human by using thePipMaker program with the chaining option. The mouse ge-nomic sequences are shown on the x-axis, and the percentagesequence identities (50%–100%) are shown on the y-axis. Anno-tation of the mouse sequences is illustrated at the top of thesequences by solid color boxes. The repeats of mouse sequenceare depicted as follows: (black pointed boxes) LINE2s; (light graypointed boxes) LINE1s; (dark gray pointed boxes) LTRs; (blacktriangles) MIRs; (light gray triangles) SINEs other than MIRs; (darkgray triangles) other repeats; (white boxes) simple repeats. Shortyellow boxes are CpG islands where the ratio of CpG/GpC ex-ceeds 0.75, and short green boxes are CpG islands where theratio of CpG/GpC is between 0.60 and 0.75. (MDIA1) the lastexon of mouse diaphanous gene 1.

Wu et al.

398 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

regulation of C-type protocadherins is different fromthat of other protocadherins.

Comparison of Protocadherin ConstantRegion SequencesWe noted previously that human Pcdh� constant re-gion exons 1 and 2 are the same length as and similarto the corresponding Pcdh� constant region exons (Wuand Maniatis 1999). The mouse Pcdh� constant regionexons 1 and 2 are also the same length as the corre-sponding Pcdh� constant region exons. The nucleotidesequences of mouse Pcdh� constant region exon 1 are63% identical to that of the corresponding Pcdh� con-stant region exon. The constant region exon sequencesare also highly conserved between human and mouse.

Specifically, the Pcdh� and Pcdh� constant coding re-gions have 96% and 91% nucleotide identities betweenhuman and mouse, respectively, while the amino acidsequences are 99% identical for both Pcdh� and Pcdh�

constant regions. Therefore, the intracellular signaltransduction pathway must be conserved between hu-man and mouse.

There are many conserved noncoding segments inthe constant region of both Pcdh� and Pcdh� gene clus-ters, as shown by PIP plot (Fig. 5A,B). The most promi-nent one is a conserved sequence segment upstream ofthe constant region exon 1 in both Pcdh� and Pcdh�

gene clusters (Fig. 7A,B). Specifically, there is an 83%sequence identity in a 200 bp intronic region and 83%sequence identity in a 300 bp intronic region upstreamof Pcdh� and Pcdh� constant region exon 1, respec-tively. These regions contain ∼50 continuous identicalnucleotides between mouse and human (Fig. 7A,B).The functional significance of these highly conservedsequences remains to be established.

Identification of a DNA Sequence Motif Upstreamof Protocadherin Variable Region ExonsBecause the members of each protocadherin gene clus-ter are very similar to each other and upstream se-quences are conserved between orthologous gene pairsin Pcdh� and Pcdh� gene clusters, we used a version ofthe Gibbs sampler program called GibbsDNA (Z. Ios-chikz and M.Q. Zhang, unpubl.) to determine whetherthe upstream sequences share any motif. Strikingly,there is a highly conserved sequence motif upstream ofall variable region exons in each protocadherin genecluster in both mouse and human (Fig. 8). The motifcannot be found in transcription factor binding sitedatabases. Moreover, this motif is located at about thesame distance from the translation start codon of eachvariable region exon (Fig. 8). In addition, we noted thatthere are several more nucleotides immediately up-stream of this motif that appear to be conserved. Wealso noted that the distribution of motifs for C-typeprotocadherin genes is different from others, in whichonly the first C-type genes in both clusters (C1 and C3in Pcdh� and Pcdh� gene clusters, respectively) havethe motifs. Although human Pcdh�-C4 has a weak mo-tif, the orthologous mouse Pcdh�-C4 does not have themotif. Interestingly, both human and mouse Pcdh�1genes do not have the motif.

A careful examination of the motif from all threegene clusters revealed a common core sequence,“CGCT” (Fig. 8). Moreover, this core sequence is sur-rounded by additional conserved sequences that arespecific for each gene cluster (Fig. 8). For example, inboth human and mouse, a CC dinucleotide is found atfixed distances upstream and downstream from thecore sequence in the Pcdh� gene cluster (Fig. 8A,D).Similarly, other cluster-specific sequences are found in

Figure 6 Upstream sequences of orthologous genes are moreconserved than paralogous genes. The maximal sequence iden-tities of all 100-bp segments within a 150-bp sliding windowwere computed for each gene pair. The x-axis represents the endposition of the sliding window relative to the translation-startcodon. The y-axis represents the percentage sequence identities.Shown are the average of 100-bp-segment maximal identities ofall orthologous (solid lines with standard deviation) gene pairs inPcdh� (A) and Pcdh� (B) gene clusters. Also shown are the maxi-mal identities between each gene and all the other paralogousmembers (excluding C-type protocadherin genes) of the samegene cluster (broken lines without standard deviation). The maxi-mal identities for each orthologous gene pair in C-type protocad-herin genes are shown individually in C. Note that the conservedregion upstream of C-type protocadherin genes is larger thanthat of other protocadherin genes.

Comparison of Mouse and Human Protocadherin Genes

Genome Research 399www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

the Pcdh� and Pcdh� gene clusters. This remarkablesimilarity of sequence motifs among genes within acluster, and between the same clusters in human andmouse, and their striking locations in the loci stronglysuggest that they are important for the regulation ofprotocadherin gene expression.

DISCUSSIONProtocadherins are members of the cadherin superfam-ily of cell-adhesion proteins (Kohmura et al. 1998; Hi-rano et al. 1999b; Yoshida and Sugano 1999; Kim et al.2000; Nollet et al. 2000; Wu and Maniatis 2000). Asubset of these proteins, originally designated CNR(Pcdh�), has been shown to be expressed at synapticjunctions and to display distinct patterns of cell-specific expression in different regions of the brain(Kohmura et al. 1998). The human counterparts of theCNR proteins were recently shown to be part of a largerfamily of proteins encoded by a cluster of genes, des-ignated Pcdh�. This cluster was found to be locatedupstream of two additional protocadherin gene clus-ters, designated Pcdh� and Pcdh� (Wu and Maniatis

1999). The striking immunoglobulin-like organi-zation of these gene clusters suggested that novelmechanisms may be involved in the regulationof their cell-specific expression in the brain(Chun 1999; Shapiro and Colman 1999; Wu andManiatis 1999; Yagi and Takeichi 2000). To gaininsight into these mechanisms, we determinedthe complete DNA sequence of the correspond-ing mouse protocadherin gene clusters. We thenperformed a comparative sequence analysis toidentify potential regulatory sequences involvedin determining the cell-specific expression of in-dividual variable region exons.

Interspecies comparative sequence analysisis a powerful tool for obtaining information ongene organization and regulation. To date, com-parative sequencing studies have been achievedfor relatively few chromosomal loci, and the con-servation of noncoding sequences varies widelybetween different loci (Ansari-Lari et al. 1998;Jang et al. 1999; Endrizzi et al. 2000). For ex-ample, there is relatively little sequence conser-vation in the intergenic regions of mammalianglobin gene clusters, or in the excision repaircross-complementing repair group 2 (ERCC2) re-gions between human and mouse (Lamerdin etal. 1996; Hardison et al. 1997). In contrast, thereis a very high level of noncoding sequence iden-tity (∼71%) within a 100-kb region of the humanand mouse T-cell receptor gene clusters (Koopand Hood 1994). We have found that the DNAsequences immediately upstream of each vari-able region exon are highly conserved betweenmouse and human orthologs (Figs. 5 and 6). A

striking example of this is the 90% sequence identitywithin 338 bp upstream of the mouse and humanPcdh�-C3 variable region exons. Other highly con-served intergenic sequences were identified in the re-gion between the last variable region exon and the firstconstant region exon. For example, one of the mostconserved sequences is located approximately 500 bpupstream of the first constant region exon in both thePcdh� and Pcdh� gene clusters (Fig. 7).

Although interspersed repeats are considered“junk” DNA sequences, recent studies have shown thatsome of them may be active in modifying the genome(Moran et al. 2000). The interspersed repeats occupy41% and 36% of the genomic sequences in the pro-tocadherin loci in mouse and human, respectively.This is much higher than that (30%) in the human �

T-cell receptor locus (Rowen et al. 1996). The numberof short interspersed nucleotide elements (SINEs) ismuch higher than that of long interspersed nucleotideelements (LINEs), in contrast to almost equal numbersof SINEs and LINEs in the human � T-cell receptorlocus (Rowen et al. 1996) and the Bpa/Str region (Mal-

Figure 7 Conserved sequences upstream from constant region exon 1 ofPcdh� (A) and Pcdh� (B) gene clusters. The identical nucleotides are shownby short vertical lines. The relative positions to the start nucleotide of con-stant region exon 1 are shown at the beginning and end of each sequence.

Wu et al.

400 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

Figure 8 Alignment of conserved sequence motif upstream of protocadherin coding region. Shown are the conserved sequences andtheir relative positions to the translation start codon in mouse Pcdh� (A), Pcdh� (B), and Pcdh� (C) and human Pcdh� (D), Pcdh� (E), andPcdh� (F) gene clusters. The probability of finding the motif within -290 to -150 nucleotides upstream of the translation start codon isshown within parentheses at right. The consensus sequences are shown below each panel. The conserved nucleotides are shown withwhite letters on a black background. The core sequences are highlighted with yellow bold letters on a red background.

Comparison of Mouse and Human Protocadherin Genes

Genome Research 401www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

lon et al. 2000). Interestingly, most of the variable re-gion 5� splice sites are immediately followed by repeatsequences.

Remarkably, the large regions between the first C-type protocadherin gene (C1 and C3 in Pcdh� andPcdh� gene clusters, respectively) and the other up-stream variable region exons are almost entirely occu-pied by repeats in both mouse and human (Fig. 5). Incontrast, the regions between the last C-type protocad-herin gene (C2 and C5 in the Pcdh� and Pcdh� geneclusters, respectively) and the first constant regionexon contain relatively few repeats in both mouse andhuman. Instead, this region has a relatively high se-quence conservation between mouse and human inboth the Pcdh� and Pcdh� gene clusters (Fig. 5A,B). Themost conserved segments are shown in Figure 7, wherelong stretches of exact sequence identity are observedbetween mouse and human.

The bulk of the mammalian genome has a GC con-tent of 40% and is poor in CpG dinucleotides, withonly 25% of the expected CpG dinucleotide frequencybased on the GC content. However, there are regions ofgenomic DNA that contain the CpG dinucleotides atabout its expected frequency, which are known as CpGislands (Antequera and Bird 1993). Both mouse andhuman genomic DNA of protocadherin gene clustershave ∼41% GC content. However, the distribution ofCpG dinucleotides is highly specific, as the ratio ofobserved to expected CpG dinucleotide frequencypeaks at the location of each variable region exon (Fig.4). It is usually assumed that each island identifies agene, because the number of CpG islands that are notassociated with genes is likely to be small (Antequeraand Bird 1993).

In summary, we annotated the mouse protocad-herin genomic DNA sequence and found that the over-all genomic organization of the three protocadheringene clusters is highly conserved between mouse andhuman. Moreover, we identified the orthologousmouse and human gene pairs in the Pcdh� and Pcdh�

gene clusters, and found that the number and order ofPcdh� and Pcdh� genes are essentially conserved be-tween mouse and human. We also found, however,that the mouse and human Pcdh� genes display bothorthologous and paralogous relationships, and themouse Pcdh� locus is larger and has six more genesthan the human locus. Finally, we showed that theupstream sequences of each variable coding region aremore conserved between orthologous than betweenparalogous genes. Within these upstream sequences,there is a conserved motif shared by almost all mem-bers of the three closely linked gene clusters. In addi-tion, the distribution of CpG islands correlates withthe locations of variable region exons. Taken together,these results strongly suggest that each protocadherinvariable region exon has a distinct promoter.

METHODS

Mouse BAC Isolation and SequencingNineteen PCR primer pairs were designed to screen a mouseBAC library (RPCI-23). The primer sequences are: ATCCCAAAATGGTGATGAAACTG and CGCTGGCAGAGGCCAAGATCA (length of product: 89 bp); CTCTGTGCACCTGGAGGAGGC and CTGGTGTTGCACTGGATACTGTT (89 bp); GAAGTGGCCAGGAATCCCAGC and CTCAGGGATGGAGTAGTGGATC (95 bp); CCACTGAAGGCCGACTGGGAAC and CTCTGGGACGGAGTAATGAAGC (101 bp); CTTCGGATGCAGACATCGGAAC and TCTTTAACACTAGTTGGAGTGG (120 bp); CGTCAGATGCAGATGTCGGTTC and AGCCCAAGAGGTTTCACCTGC (110 bp); ATCCGATGCAGATATCGGAGTC and CTTTAACACAAGGGATAACGAAG (120 bp); ATCTGATTTGGATATAGGAGCC and GAGCAACAAACGATGCTCTTGG (165bp); CGGACATAGGAGAGAACGCTG and CCTTCTTTAATATAAGTGACGGTC (120 bp); CTAGAAGGCGCCTCTGATGCAG and AGTTTTCGAAGAACAAGCACTGG (140 bp); AAGAGACGGTTCCGGAAGACAG and AACGAGTACTGACAGCTTCTGC (110 bp); CAGAGTGGATCGAGTGCCCTTG andGGTCACCATCTACTGTGGCTAC (140 bp); CTGGCTGTCATTCCAACTTCTC and GTAGCCACAGTAGATGGTGACC (140bp); CCAAGTCTCCTACACCATGCTC and GTGATGTGGGCATTGGAGCCTG (100 bp); CTGCATGGATGTGCAATCTGAGand CTCTCTGTTTCTTCCTCTATGG (200 bp); GCAGGCTATTAACTGACAGGTC and GAGAAAGATCAACAGAACTTGCC(120 bp); GTCCCAGAACTACCAATATGAG and AGGGTCATGGAGCTGAAGACTG (100 bp); AAATGTGCTGTGGTTGTAGAGG and ACAGCAACAACTGTCTCTTGTG (110 bp);GAAGGTATTTGAGCGTGATCTAG and CTTCTTCTAGTCAGTTTCAATCCAC (120 bp). A total of 21 mouse BAC cloneswere isolated, and their sizes were estimated by pulse-field gelelectrophoresis. The clones were digested with BstZ17I. Therestriction map was assembled from the resulting fragments.Seven minimally overlapping mouse clones were selected forshotgun sequencing. The chromosomal locations of the se-lected clones were mapped by FISH. Draft sequences of theseBAC clones were produced by the DOE Joint Genome Insti-tute. The sequences of all the mouse BACs and four humangap-closing clones were finished by the sequencing group atthe Stanford Human Genome Center. All of the other humanclones were finished by the DOE Joint Genome Institute. Thefinished sequences for the mouse and human clones containno gaps and are estimated to contain less than one error per200,000 bp. The GenBank accession nos. for the mouse clonesare AC020967, AC020968, AC020969, AC020971, AC020972,AC020973, and AC020974. The GenBank accession nos. forthe human clones are AC005366, AC004776, AC005609,AC005618, AC005752, AC005754, AC008468, AC010223,AC025436, and AC074130. In addition, the sequences andquality scores for each base position can be found at http://www-shgc.stanford.edu/Seq/Status/doe.html.

Phylogenetic AnalysisThe variable region coding sequences were translated, and theresulting polypeptides were aligned using the Pileup pro-gram of the GCG sequence analysis package (Genetics Com-puter Group 1999) with default parameters. A phylogenetictree was reconstructed by using PAUP (Phylogenetic AnalysisUsing Parsimony), version 4.0.0 (Swofford et al. 1996), withdistance as an optimality criterion. Gaps in the alignmentwere treated as missing. The robustness of the tree partitions

Wu et al.

402 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

was evaluated by using the bootstrap analysis with a neigh-bor-joining search.

Sequence Analysis

AnnotationThe mouse protocadherin coding regions were annotated byusing the BLAST program (Altschul et al. 1997) and the GCG(Genetics Computer Group 1999) sequence analysis package.The potential coding sequences were aligned by using themultiple sequence alignment program Pileup. The 5� splicesites were identified manually by inspecting the alignment ofmouse sequences to the corresponding human and mousecDNA sequences. All of the variable region 5� splice sites con-form to the splice site consensus sequences and were verifiedby sequencing the cDNA fragments spanning splice junctionsbetween variable and constant region exons. The putativetranslation start codon was determined by inspecting thetranslated signal peptide sequences.

ComparisonHuman and mouse Pcdh�, Pcdh�, and Pcdh� genomic se-quences were assembled from the finished sequences of therespective BAC clones using the Seqed program of the GCGpackage. The CpG island distribution was plotted by using theCpGplot program (Larsen et al. 1992). The repeats weremasked with the use of the RepeatMasker program (A.F.A.Smit and P. Green, http://ftp.genome.washington.edu/RM/RepeatMasker.html). The masked mouse genomic sequencesof Pcdh� and Pcdh� gene clusters were compared with thecorresponding human genomic sequences by using Pip-Maker (Schwartz et al. 2000) on the Web server http://bio.cse.psu.edu/pipmaker/. We used the chaining option ofPipMaker for the Pcdh� and Pcdh� clusters because their geneorders are conserved.

We assigned human and mouse orthologous protocad-herin gene pairs based on the phylogenetic trees (Fig. 3). Incases of paralogous relationships in the phylogenetic tree, weassigned orthologs based on highest sequence identity. Tosystematically compare the upstream sequences of ortholo-gous and paralogous genes, the upstream sequences were ex-tracted according to our annotation from RepeatMasked ge-nomic sequences (without masking low-complexity DNA).For each orthologous and paralogous gene pair, the maximalsequence identity among all 100 bp segments within a 150 bpsliding window was calculated (any masked sequences werecounted as mismatches in the calculation). In the case of hu-man Pcdh�2 and Pcdh�-C1, the sliding window size was 250bp. The maximal 100 bp-segment identity within each win-dow was plotted against its end position relative to the trans-lation start codon.

We used a version of the Gibbs sampler program to iden-tify the conserved sequence motifs upstream of all variableregion exons within each protocadherin gene cluster. Theprogram also calculates the probability (ranging from 0 to 1)of finding the motif within -290 to -150 nucleotides upstreamof each variable region start codon.

ACKNOWLEDGMENTSWe thank E. Branscomb and T. Hawkins for supporting theDNA sequence determination of the mouse protocadheringene clusters at the Joint Genome Institute. We also thank theSequencing Group at the Stanford Human Genome Center forfinishing the clones. We are grateful to W. Miller for advice on

PIP analysis, S. Ribich, B. Tasic, P. Cramer, and C. Nabholz fordiscussion and critical comments on the manuscript. Thiswork was supported by grants from the NIH to T.M.(GM42231), from the Cancer Research Fund of the DamonRunyon-Walter Winchell Foundation to Q.W. (DRG-1559),from the NIH to M.Q.Z. (HG01696), from the DOE to J.-F.C(DE-AC03–76SF00098) and to R.M.M. (DE-FC03–99ER62873).

The publication costs of this article were defrayed in partby payment of page charges. This article must therefore behereby marked “advertisement” in accordance with 18 USCsection 1734 solely to indicate this fact.

REFERENCESAltschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,

Miller, W., and Lipman, D.J. 1997. Gapped BLAST andPSI-BLAST: A new generation of protein database searchprograms. Nucleic Acids Res. 25: 3389–3402.

Ansari-Lari, M.A., Oeltjen, J.C., Schwartz, S., Zhang, Z., Muzny,D.M., Lu, J., Gorrell, J.H., Chinault, A.C., Belmont, J.W., Miller,W., et al. 1998. Comparative sequence analysis of a gene-richcluster at human chromosome 12p13 and its syntenic region inmouse chromosome 6. Genome Res. 8: 29–40.

Antequera, F. and Bird, A. 1993. Number of CpG islands and genesin human and mouse. Proc. Natl. Acad. Sci. 90: 11995–11999.

Bruses, J.L. 2000. Cadherin-mediated adhesion at the interneuronalsynapse. Curr. Opin. Cell Biol. 12: 593–597.

Camacho, J.A., Obie, C., Biery, B., Goodman, B.K., Hu, C.A.,Almashanu, S., Steel, G., Casey, R., Lambert, M., Mitchell, G.A.,et al. 1999. Hyperornithinaemia-hyperammonaemia-homocitrullinuria syndrome is caused by mutations in a geneencoding a mitochondrial ornithine transporter. Nat. Genet.22: 151–158.

Chiang, C.M. and Roeder, R.G. 1995. Cloning of an intrinsic humanTFIID subunit that interacts with multiple transcriptionalactivators. Science 267: 531–536.

Chun, J. 1999. Developmental neurobiology: A genetic Cheshire cat?Curr. Biol. 9: R651–R654.

Cross, S.H., Charlton, J.A., Nan, X., and Bird, A.P. 1994. Purificationof CpG islands using a methylated DNA binding column. Nat.Genet. 6: 236–244.

Dreyer, W.J. and Roman-Dreyer, J. 1999. Cell-surface area codes:Mobile-element related gene switches generate precise andheritable cell-surface displays of address molecules that are usedfor constructing embryos. Genetica 107: 249–259.

Endrizzi, M.G., Hadinoto, V., Growney, J.D., Miller, W., andDietrich, W.F. 2000. Genomic sequence analysis of the mousenaip gene array. Genome Res. 10: 1095–1102.

Genetics Computer Group. 1999. Program Manual for the WisconsinPackage Version 10.0. Genetics Computer Group (GCG),Madison, WI.

Gumbiner, B.M. 2000. Regulation of cadherin adhesive activity. J.Cell Biol. 148: 399–404.

Hagler, D.J., Jr. and Goda, Y. 1998. Synaptic adhesion: The buildingblocks of memory? Neuron 20: 1059–1062.

Hardison, R.C., Oeltjen, J., and Miller, W. 1997. Long human-mousesequence alignments reveal novel regulatory elements: A reasonto sequence the mouse genome. Genome Res. 7: 959–966.

Hirano, S., Ono, T., Yan, Q., Wang, X., Sonta, S., and Suzuki, S.T.1999a. Protocadherin 2C: A new member of the protocadherin 2subfamily expressed in a redundant manner withOL-protocadherin in the developing brain. Biochem. Biophys. Res.Commun. 260: 641–645.

Hirano, S., Yan, Q., and Suzuki, S.T. 1999b. Expression of a novelprotocadherin, OL-protocadherin, in a subset of functionalsystems of the developing mouse brain. J. Neurosci.19: 995–1005.

Jang, W., Hua, A., Spilson, S.V., Miller, W., Roe, B.A., and Meisler,M.H. 1999. Comparative sequence of human and mouse BACclones from the mnd2 region of chromosome 2p13. Genome Res.

Comparison of Mouse and Human Protocadherin Genes

Genome Research 403www.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from

9: 53–61.Kim, S.H., Jen, W.C., De Robertis, E.M., and Kintner, C. 2000. The

protocadherin PAPC establishes segmental boundaries duringsomitogenesis in Xenopus embryos. Curr. Biol. 10: 821–830.

Kohmura, N., Senzaki, K., Hamada, S., Kai, N., Yasuda, R., Watanabe,M., Ishii, H., Yasuda, M., Mishina, M., and Yagi, T. 1998.Diversity revealed by a novel family of cadherins expressed inneurons at a synaptic complex. Neuron 20: 1137–1151.

Koop, B.F. and Hood, L. 1994. Striking sequence similarity overalmost 100 kilobases of human and mouse T-cell receptor DNA.Nat. Genet. 7: 48–53.

Lamerdin, J.E., Stilwagen, S.A., Ramirez, M.H., Stubbs, L., andCarrano, A.V. 1996. Sequence analysis of the ERCC2 generegions in human, mouse, and hamster reveals three linkedgenes. Genomics 34: 399–409.

Larsen, F., Gundersen, G., Lopez, R., and Prydz, H. 1992. CpGislands as gene markers in the human genome. Genomics13: 1095–1107.

Lynch, E.D., Lee, M.K., Morrow, J.E., Welcsh, P.L., Leon, P.E., andKing, M.C. 1997. Nonsyndromic deafness DFNA1 associated withmutation of a human homolog of the Drosophila genediaphanous. Science 278: 1315–1318.

Mallon, A.M., Platzer, M., Bate, R., Gloeckner, G., Botcherby, M.R.,Nordsiek, G., Strivens, M.A., Kioschis, P., Dangel, A.,Cunningham, D., et al. 2000. Comparative genome sequenceanalysis of the Bpa/Str region in mouse and man. Genome Res.10: 758–775.

Moran, J.V., DeBerardinis, R.J., and Kazazian, H.H., Jr. 2000. Exonshuffling by L1 retrotransposition. Science 283: 1530–1534.

Nollet, F., Kools, P., and van Roy, F. 2000. Phylogenetic analysis ofthe cadherin superfamily allows identification of six majorsubfamilies besides several solitary members. J. Mol. Biol.299: 551–572.

O’Hanlon, T.P., Raben, N., and Miller, F.W. 1995. A novel geneoriented in a head-to-head configuration with the humanhistidyl-tRNA synthetase (HRS) gene encodes an mRNA thatpredicts a polypeptide homologous to HRS. Biochem. Biophys. Res.Commun. 210: 556–566.

Redies, C. 2000. Cadherins in the central nervous system. Prog.Neurobiol. 61: 611–648.

Rowen, L., Koop, B.F., and Hood, L. 1996. The complete685-kilobase DNA sequence of the human � T-cell receptor locus.

Science 272: 1755–1762.Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J.,

Gibbs, R., Hardison, R., and Miller, W. 2000. PipMaker–A webserver for aligning two genomic DNA sequences. Genome Res.10: 577–586.

Serafini, T. 1999. Finding a partner in a crowd: Neuronal diversityand synaptogenesis. Cell 98: 133–136.

Shapiro, L. and Colman, D.R. 1999. The diversity of cadherins andimplications for a synaptic adhesive code in the CNS. Neuron23: 427–430.

Steinberg, M.S. and McNutt, P.M. 1999. Cadherins and theirconnections: Adhesion junctions have broader functions. Curr.Opin. Cell Biol. 11: 554–560.

Sugino, H., Hamada, S., Yasuda, R., Tuji, A., Matsuda, Y., Fujita, M.,and Yagi, T. 2000. Genomic organization of the family of CNRcadherin genes in mice and humans. Genomics 63: 75–87.

Suzuki, S.T. 1996. Protocadherins and diversity of the cadherinsuperfamily. J. Cell Sci. 109: 2609–2611.

Swofford, D.L., Olsen, G.J., Waddell, P.J., and Hillis, D.M. 1996.Phylogenetic inference. In Molecular systematics, (eds. D.M. Hilliset al.), pp. 407–514. Sinauer Associates, Sunderland,Massachusetts.

Wu, Q. and Maniatis, T. 1999. A striking organization of a largefamily of human neural cadherin-like cell adhesion genes. Cell97: 779–790.

———. 2000. Large exons encoding multiple ectodomains are acharacteristic feature of protocadherin genes. Proc. Natl. Acad. Sci.97: 3124–3129.

Yagi, T. and Takeichi, M. 2000. Cadherin superfamily genes:Functions, genomic organization, and neurologic diversity. Genes& Dev. 14: 1169–1180.

Yamagata, K., Andreasson, K.I., Sugiura, H., Maru, E., Dominique,M., Irie, Y., Miki, N., Hayashi, Y., Yoshioka, M., Kaneko, K., et al.1999. Arcadlin is a neural activity-regulated cadherin involved inlong term potentiation. J. Biol. Chem. 274: 19473–19479.

Yoshida, K. and Sugano, S. 1999. Identification of a novelprotocadherin gene (PCDH11) on the human XY homologyregion in Xq21.3. Genomics 62: 540–543.

Received October 16, 2000; accepted in revised form January 9, 2001.

Wu et al.

404 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on January 15, 2014 - Published by genome.cshlp.orgDownloaded from