180

The american journal of human genetics (AJHG) Vol 90 Nº4, 2012

Embed Size (px)

DESCRIPTION

Ejemplar de la Revista Americana de Genética Humana, Volúmen 90, Nro 4 del año 2012

Citation preview

EDITORS’ CORNER

This Month in The Journal

Sara B. Cullinan1

Genomic Privacy in GWAS?

Im et al., page 591

Recent technological advances have made it possible to

interrogate human phenotypes at a previously unimagin-

able scale. But, as with any collection of personal data, it

is important to ensure individual privacy. Indeed, previous

investigations into the ability to discern an individual’s

participation in genetic studies have led to the withdrawal

of allele frequencies from publicly available results. In this

issue, Im et al. probe deeper, questioning how much

private information can be extracted from typically re-

ported statistics, such as regression coefficients or p values.

Through a series of analyses, the authors determine that

regression coefficients can, in some cases, provide just as

much information as allele frequencies, thus creating a

situation in which even statistics that were thought to be

‘‘safe’’ can in fact identify participants and their medical

history. The possibility of membership detection is espe-

cially high in cases in which multiple phenotypes are

being reported, e.g., in multiple-omics data sets. With

exome- and whole-genome sequencing (and the large

data sets that they generate) becoming more common, it

is clear that many additional discussions between scien-

tists, clinicians, and ethicists are needed to ensure that

privacy can be maintained without sacrificing the dissem-

ination of research findings.

A Major mtDNA Shake-Up

Behar et al., page 675

In 1981, the revised Cambridge Reference Sequence was

published. It immediately became the standard against

which human mtDNA is compared and phylogenies are

derived. Indeed, its publication enabled a tremendous

amount of research aimed at better understanding human

history.However, the realization that this sequence belongs

to a recently coalescing European haplogroup creates

several concerns about inconsistencies and misinterpreta-

tion. To address these concerns, Behar et al. set out to reas-

sess and refine the human mtDNA phylogeny, and in so

doing, they constructed a new reference mtDNA sequence,

termed the Reconstructed Sapiens Reference Sequence

(RSRS). Generated through the assessment of over 18,000

human mtDNA sequences, as well as those of Homo

neanderthalensis, the RSRS performs well in molecular clock

analyses and lays the groundwork for a new way of ana-

lyzing mtDNA. Although this change will require a large

amount of rethinking, the authors put forth a coherent

plan to make this feasible, including tools to transform

previously generated data and analyses. With the amount

of deep-sequencing data that should become available in

the coming years, the RSRS presents a ‘‘next-generation’’

approach to understanding human matrilineal diversity.

First Steps toward Understanding Birth Weight

Ishida et al., page 715

Babies come in many different sizes, but being too small

is a major health concern. Indeed, intrauterine growth

restriction (IUGR) serves as a risk factor for several adult

diseases, including obesity and type 2 diabetes. Although

maternal health plays a large role in directing fetal growth,

the genetic factors that contribute to the variability in fetal

size remain poorly understood. Of interest, however, are

those genes that undergo imprinting, a process by which

the parent of origin determines monoallelic expression.

Evolutionary theory posits that expression of alleles in-

herited from the father promote in utero growth, whereas

those inherited from the mother inhibit growth. But what

happens if the maternally inherited allele exhibits an

altered expression pattern? Might the balance be tipped?

In this issue, Ishida et al. explored the possibility that

variants in PHLDA2, which is only expressed from the

maternal allele, might influence birth weight. Their studies

identified a variant in the PHLDA2 promoter region that

eliminates several consensus transcription factor binding

sites and should therefore lead to decreased expression.

Then, through a cross-sectional study of normal births,

they showed that inheritance of this variant (from the

mother), as well as maternal homozygosity, correlated

with increased birth weight. Future studies, focused specif-

ically on IUGR, should help to elucidate how variation in

PHLDA2, and potentially in other imprinted genes, con-

tributes to the regulation of birth weight and related

complications.

Evolutionary History of AD Risk Alleles

Raj et al., page 720

Alzheimer’s disease (AD) is the most common neuro-

degenerative disease, and as of yet, there are no effective

1Deputy Editor, AJHG

DOI 10.1016/j.ajhg.2012.03.008. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 575–576, April 6, 2012 575

treatments, let alone a cure. Therefore, there is great

interest in better understanding the causes of the disease

from both biochemical and genetic standpoints. The

best-characterized genetic risk factor is the ε4 haplotype

of APOE, which, interestingly, shows evidence of having

undergone positive selection, most likely because of an

effect on an unrelated phenotype. With this in mind, Raj

et al. set out to identify other possible indications of selec-

tion in loci shown to associate with AD susceptibility. They

found such evidence, all in East Asian populations, for

three loci, suggesting that the same selective pressure

might have acted on each. Given that AD is unlikely to

serve in such a role, the authors posited that pathogen

exposure might have been the driving force. Indeed,

many signatures of selection in the human genome are

attributed to interactions with pathogens. Interestingly,

the protein products generated at these loci appear to

belong to the same interaction network. This finding

suggests that additional clues about AD risk might be

found by interrogating other branches of this network.

Although much remains to be learned about the variants

that contribute to AD risk, the study of their evolution,

and possible coevolution, will no doubt yield insights

into the underlying biology of the disease.

X Marks the Spot in Breast Cancer Research

Park et al., page 734

The ubiquitous pink ribbons serve as a reminder that

many women (and some men) are affected by breast

cancer. Although well known, BRCA1 and BRCA2 muta-

tions account for a minority of hereditary cancers. There-

fore, a better understanding of the biology of breast

cancer, along with better screening tests, is sought by

many families. To help achieve these goals, Park et al.

used exome sequencing and identified rare mutations in

XRCC2 that serve as susceptibility factors for familial

breast cancer. XRCC2 is a RAD51 paralog that is required

for efficient homologous recombination (HR); its loss

leads to marked genome instability and aneuploidy.

Future studies aimed at delineating the exact role of

XRCC2 mutations, as well as mutations that lie within

the same pathway, in disease onset and/or progression

should aid in the discovery of new treatment options.

This finding adds to the list of genes whose protein prod-

ucts perform crucial roles in HR and whose mutations can

influence breast cancer risk. It also provides support for

those who seek to better understand common diseases

through sequencing studies.

576 The American Journal of Human Genetics 90, 575–576, April 6, 2012

EDITORS’ CORNER

This Month in Genetics

Kathryn B. Garber1,*

Big Gene, Big Heart

Although the cardiomyopathies have a substantial genetic

etiology, genetic testing for this class of heart disorders has

been notoriously difficult. Indeed, the causative mutation

is found in only 20%–30% of patients with dilated cardio-

myopathy. Titin is a candidate gene for cardiomyopathy

that has been examined for mutations to a limited extent

due to its massive coding sequence, which is ~100 kb

in size. Herman et al. recently published data showing

that the sequence hurdle for this gene is worth the effort.

Through next-generation sequencing, they identified

a truncating TTN mutation in ~25% of familial cases of

idiopathic dilated cardiomyopathy, moving TTN to the

forefront of genes involved in this form of the disease.

Although these mutations had very high penetrance after

age 40 in familial cases, there is also a significant amount

of TTN variation whose clinical significance is difficult to

interpret at this time. This includes missense variation,

which was not analyzed in this current paper, so its role

in cardiomyopathy is unclear. Even with truncating muta-

tions in TTN, interpretation is not always simple; these

mutations were identified, albeit at lower frequency, in

control individuals and in individuals with hypertrophic

cardiomyopathy who also had a pathogenic mutation in

a known disease gene.

Herman et al. (2012) NEJM 366, 619–628.

A Complex Balance

Perhaps it is not surprising that the more closely you look

at something, the more you see. Certainly, the advent of

whole-genome comparative genomic hybridization

(CGH) arrays taught us that many people with normal

G-banded karyotypes have cytogenetic aberrations when

we look more closely. Even high-resolution CGH arrays

don’t give us a complete picture of chromosomes, as

recently illustrated by Chiang et al. These investigators

took a set of individuals who had apparently balanced

chromosome translocations—at least based on G-banding

and whole-genome CGH arrays—and they analyzed the

breakpoints at the nucleotide level. What they found was

an unexpectedly high level of complexity to the break-

points. In almost 20% of cases, three or more breakpoints

were involved, but in some cases, a shockingly complex

interweaving of segments occurred, akin to what was

recently described in cancer cells as ‘‘chromothripsis,’’ or

chromosome shattering and reorganization. The cases

analyzed by Chiang et al. involved upward of ten break-

points with inverted segments interspersed among seg-

ments of the expected orientation. This phenomenon is

not limited to spontaneous rearrangements in humans;

analysis of transgene insertions in mice and in sheep

revealed that the sites of integration can be similarly

complex.

Chiang et al. (2012) Nat. Genet. Published online March 4,

2012. 10.1038/ng.2202.

Good News for Men

The Y chromosome is just a degenerate of its former auto-

somal self that is on its way to extinction, or so some have

proposed. If you compare the Y to the X chromosome, for

instance, the Y has lost many of the genes that the

chromosomes once shared, and without a companion

chromosome with which to fully pair itself during meiosis,

some think this sex-specific chromosome is doomed.

David Page argues otherwise. His group does species

comparisons of the Y chromosome in order to understand

its evolution and to better predict the future fate of the Y.

Page’s group previously compared the human to the chim-

panzee Y chromosome, which diverged about six million

years ago, but, in order to look at a much longer evolu-

tionary window, his group recently compared the human

and rhesus macaque Y chromosomes, which diverged

25 million years ago. This comparison yielded a surprising

level of evolutionary stability on the Y. In the majority of

the male-specific regions of the Y chromosome, rhesus

macaques and humans share the same ancestral genes,

arguing for Y chromosome stability over the long haul.

In only a very restricted segment of the Y has gene loss

occurred in humans since the split from the Old World

monkeys. Their data fit a model in which rapid degenera-

tion of segments on Y was followed by marked slowing

of this decay and chromosome stabilization. Don’t count

the Y out just yet; it looks like it may stick around a while.

Hughes et al. (2012) Nature 483, 82–86.

Enhancers Acting as Promoters

Just as we learn to group letters into words and bin words

into different parts of speech in order to extract meaning

from sentences, we try to interpret genome sequences by

picking out the nucleotide sets that comprise genes and

attempting to recognize the regulatory elements from

strings of As, Cs, Gs, and Ts. But although we might think

1Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.03.009. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 577–578, April 6, 2012 577

we understand what a particular type of genetic element

does, recognition of one of its roles in gene expression

sometimes doesn’t tell the whole story. Take enhancers,

for instance. These are well-studied cis elements that

have a simple job: they bind transcription factors and

enhance expression from gene promoters, hence their

name. Kowalczyk et al. wondered whether that’s all

enhancers do, and they ended up with evidence that intra-

genic enhancers can also act as alternative tissue-specific

promoters. The resulting mRNAs are spliced and polyade-

nylated but do not appear to be translated into protein.

Because enhancers are much more common than classic

promoters and because about half of enhancers are intra-

genic, this promoter-like activity could contribute substan-

tially to the complexity of the mammalian transcriptome.

The next step is to figure out how these untranslated tran-

scripts are used.

Kowalczyk et al. (2012) Mol. Cell 45, 447–458.

A Common Turn-On

While we’re on the subject of surprising roles for noncod-

ing elements, a recent paper uncovered the coordinated

regulation of two neighboring, but nonparalogous, genes

that both tie into an identical phenotype. Joe Gleeson’s

group focuses on ciliopathies, and they recently identified

mutations in TMEM216 at the JBTS2 locus that cause Jou-

bert syndrome. Of the ten JBTS2-linked families, however,

only about half of them had a TMEM216mutation, despite

an identical phenotype to the mutation-containing

families. When they resequenced the JBTS2 locus, they

found mutations in a neighboring gene, TMEM138, that

is not related to TMEM216, although it also encodes a

transmembrane protein. Although your first thought

might be that TMEM138 simply contains a regulatory

element for TMEM216, this is not the case. Rather, both

genes are coordinately expressed via the action of an inter-

genic element, and they both encode proteins involved

in the same process, ciliogenesis. Knockdown of either

protein leads to defective ciliogenesis, which ultimately

is central to the Joubert syndrome phenotype. Thus,

despite the fact that the genes are very different, they

have evolved a system of coordinated regulation and func-

tional relatedness.

Lee et al. (2012) Science 335, 966–930.

This Month in Our Sister Journal

Yeast System for Characterization of Cystathionine-

Beta-Synthase Mutations

Although we know that individuals with deficiency of

cystathionine-beta-synthase (CBS) tend to have intellec-

tual disability, a marfanoid habitus, ectopia lentis, and

increased risk of thromboembolism, there is variable

expressivity for this disorder, and it is difficult to predict

outcome from genotype. Dietary protein and methionine

restriction is the central approach to management, and

supplementation with vitamin B6, a cofactor of CBS, can

lead to further reductions in homocystine levels in some

affected individuals, who tend to have milder disease. To

address the challenge of genotype-phenotype correlations

in CBS deficiency, Mayfield et al. used a yeast system to

characterize the function of all 84 CBS missense alleles

that had been documented as of 2010. This system, in

which the yeast ortholog of CBS is replaced by human

alleles, allows them to assess the general level of function,

as well as the responsiveness of each allele to vitamin B6

and to another cofactor, heme. The authors also propose

that glutathione deficiency should be further explored in

the context of CBS deficiency, because they noted reduced

glutathione production in their systemwhen CBS function

was disabled.

Mayfield et al. (2012) Genetics. Published online January 20,

2012. 10.1534/genetics.111.137471.

578 The American Journal of Human Genetics 90, 577–578, April 6, 2012

REVIEW

Fragile X and X-Linked Intellectual Disability:Four Decades of Discovery

Herbert A. Lubs,1 Roger E. Stevenson,1,* and Charles E. Schwartz1

X-Linked intellectual disability (XLID) accounts for 5%–10% of

intellectual disability in males. Over 150 syndromes, the most

common of which is the fragile X syndrome, have been described.

A large number of families with nonsyndromal XLID, 95 of which

have been regionally mapped, have been described as well. Muta-

tions in 102 X-linked genes have been associated with 81 of these

XLID syndromes and with 35 of the regionally mapped families

with nonsyndromal XLID. Identification of these genes has

enabled considerable reclassification and better understanding of

the biological basis of XLID. At the same time, it has improved

the clinical diagnosis of XLID and allowed for carrier detection

and prevention strategies through gamete donation, prenatal

diagnosis, and genetic counseling. Progress in delineating XLID

has far outpaced the efforts to understand the genetic basis for

autosomal intellectual disability. In large measure, this has been

because of the relative ease of identifying families with XLID

and finding the responsible mutations, as well as the determined

and interactive efforts of a small group of researchers worldwide.

Introduction

Mutations resulting in X-linked intellectual disability

(XLID) have been described in 102 genes (Table S1, avail-

able online).1 This work was accomplished over a 40 year

period during which the term X-linked mental retardation

was widely used; however, we will use intellectual

disability (ID), which is emerging as the preferred termi-

nology. Mutations in these 102 genes are responsible for

81 of the known 160 XLID syndromes and over 50 families

with nonsyndromal XLID (Table S1 and Figures 1 and 2).

An additional 30 XLID syndromes and 48 families with

nonsyndromal XLID have been regionally mapped (Table

1 and Figures 2 and 3), but the genes not yet identified.

Forty-four XLID syndromes, which remain unmapped,

have also been described (Table S2). Fewer than 400 auto-

somal genes in which mutations resulted in ID have

been identified. Of 1,640 references to ID in OMIM (as of

March 2010), 316 are entities on the X chromosome. Three

comparably sized chromosomes (6, 7, and 8) show 50, 58,

and 60 references, respectively. Several authors have

recently discussed the possibility that these striking differ-

ences might result from a relative concentration of genes

that influence intelligence on the X chromosome.2,3

Identification of the mutations in 102 genes that cause

XLID has been accomplished primarily through long-

term, planned and coordinated studies from the United

States, Europe, and Australia. These studies took advantage

of the power of pedigrees of relatively large families to

assign putative genes to the X chromosome, linkage anal-

ysis to achieve regional localizations, accumulation and

sharing of large data banks of clinical details and speci-

mens, registries of pertinent X chromosomal transloca-

tions and abnormalities, stored samples from a variety of

populations around the world with ID and effective

communication between numerous investigators. In this

setting, the continuously developing technologies were

applied and reapplied to the available clinical and spec-

imen banks effectively and rapidly. A comparable system-

atic approach to autosomal ID has not been carried out.

Publication of the first family with the marker X,4 later

renamed the fragile X (MIM 300624),5 gave an important

impetus to the field by providing a laboratory tool

which clearly identified the most prevalent XLID syn-

drome. A series of biennial international meetings on

fragile X syndrome and XLID, beginning in 1983, involved

about 100 investigators and provided a sense of unity and

progress to the field. Papers and abstracts from these meet-

ings and from other research were published (usually bien-

nially) as conference reports, special issues or updates on

XLID from 1984 to 2008.6–16

The focus of this review will be the discovery process

rather than the details of the clinical or molecular findings

in the individual XLID entities. Readers are referred to the

recently updated excellent review of the fragile X in OMIM

(MIM 300624) and OMIM entries on other XLID disorders

as detailed in Tables S1 and S2. Other reviews of different

aspects of XLID include the periodic XLID updates from

1984 to 2008, an Atlas of XLID Syndromes,1 and a number

of commentaries by individual investigators.3,17–22

XLID before Fragile X

The prelude to the current cytogenetic and molecular era

covered a century (1868–1968). It encompassed descrip-

tions of a number of clinically defined entities (Pelizaeus-

Merzbacher disease [MIM 312080], Duchenne muscular

dystrophy [MIM 310200], incontinentia pigmenti [MIM

308300], Goltz focal dermal hypoplasia [MIM 305600],

Lenz microphthalmia syndrome [MIM 309800]), inborn

errors of metabolism (Hunter syndrome [MIM 309900],

Lowe syndrome [MIM 309000], Lesch-Nyhan syndrome

[MIM 300322]), and large pedigrees in which ID segregated

with an X-linked pattern.23–28 During the same period, the

excess of males among persons with ID was observed in

1Greenwood Genetic Center, JC Self Research Institute of Human Genetics, 113 Gregor Mendel Circle, Greenwood, SC 29646, USA

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.018. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 579–590, April 6, 2012 579

census surveys and other population studies.29–31 The

magnitude of the male excess, varied from study to study

but averaged about 30 percent and was found in nearly

all studies.

These two observations—the excess of males among

persons with ID and clinical syndromes or families with

ID that segregated with an X-linked pattern—provided

compelling evidence that genes on the X chromosome

were important contributors to the overall causation of

ID and, hence, of individual, familial, and societal signifi-

cance. By virtue of having but a single X chromosome,

the male’s genome was uniquely vulnerable and that

vulnerability extended to brain development and function

as well as to other systems.

Further insights during this early period of time were

that XLID comprised syndromal entities (ID plus somatic,

metabolic, or neuromuscular manifestations) and nonsyn-

dromal entities (ID alone or with inconsistent abnormali-

ties). It also became clear that some females in XLID

pedigrees had intellectual limitations, albeit with neither

the consistency nor the severity of males. Technological

limitations (lack of tools for linkage analysis and gene

isolation) precluded a more precise genetic characteriza-

tion of XLID disorders and delayed the clinical delineation.

The Setting of the Initial Observation of the Marker X

In 1966, when a one-year-old boy and his brother were

referred to the Yale chromosome laboratory for study

because of delayed development, medical cytogenetics

was in a period of transition. The major trisomies as well

as translocations and large deletions had been defined by

nonspecific orcein or Giemsa staining. Prenatal cytoge-

netic diagnosis had begun and in order to provide more

predictive developmental information to families, there

was a need for both better, less biased clinical information

about X and Y aneuploidy and the several types of smaller

variations in the short arms of the acrocentric chromo-

somes and variant heterochromatic regions on 1, 9, 16,

and Y. The Yale laboratory had selected a minimal media

(199) for both routine diagnostic studies and for a year-

long study of 4,500 consecutive cord blood and 500

maternal samples. Special attention was given to breaks,

gaps, and chromosome variants in the year-long study.

The study also sought to identify cytogenetic markers

offin-Lowr (RPSKA3, RSK2)

Telecanthus-hypospadias (MID1)Oral-facial-digital I (OFD1)

Spermine synthase deficiency (SMS)XLID-infantile seizures, Rett like (CDKL5, STK9)

Autism (NLGN4)

MIDAS (HCCS)Turner, XLID-hydrocephaly-basal ganglia calcification

VACTERL-hydrocephalus (FANCB)

22.322.2

(AP1S2)

C y

Pyruvate dehydrogenase deficiency (PDHA1)Glycerol kinase deficiency (GKD)

Duchenne muscular dystrophy (DMD)

Ornithine transcarbamoylase deficiency (OTC)Monoamine oxidase-A deficiency (MAOA)Norrie (NDP)

Partington, West, Proud, XLAG (ARX)

Nance-Horan (NHS)

XIDE (Renin receptor; ATP6AP2

OFCD, Lenz microphthalmia (BCOR)

22.1

21.321.221.1

11 4

Ichthyosis follicularis, atrichia, photophobia (MBTPS2)

Chaissaing Lacombe chondrodysplasia (HDAC6)

XLID-nystagmus-seizures (CASK)

MEHMO (EIF2S3)

Aarskog (FGDY)

b ll d i (OPHN

( p )

XLID-choreoathetosis (HADH2)

Stocco dos Santos (SHROOM4, KIAA1202)XLID l ft li / l t (PHF8)

Epilepsy/macrocephaly (SYN1)Cornelia de Lange, X-linked (SMC1L1, SMC1A)

Renpenning, Sutherland-Haan,Cerebropalatocardiac (Hamel),

Golabi-Ito-Hall, Porteous(PQBP1)

11

11.411.3

11.1

11.2311.2211.21

Goltz (PORCN)XLID-macrocephalyJuberg-Marsidi-Brooks

(HUWE1)

-

TARP (RBM10)

-Thalassemia Intellectual DisabilityXLID-hypotonic facies, Carpenter-Waziri,Holmes-Gang, Chudley-Lowry, XLID-arch

(ATRX, XNP, XH2)

Phosphoglycerate kinase deficiency (PGK1)Menkes disease (ATP7A)

XLID-cerebellar dysgenesis -1)-cleft lip/palate

Allan-Herndon (SLC16A2, MCT8) Opitz-Kaveggia FG, Lujan (MED12, HOPA)

XLID-macrocephaly-large ears (BRWD3)

Graham coloboma (IGBP1)

Cantagrel spastic paraplegia (KIAA2022) 13

12

21.121 2

Cornelia de Lange, X-linked (HDAC8)

Pelizaeus-Merzbacher (PLP)Mohr-Tranebjaerg (TIMM8A, DDP)

Lissencephaly, X-linked (DCX)

fingerprints-hypotonia, Smith-Fineman-Myers(?)

XLID-optic atrophy (AGTR2)Arts, PRPP synthetase superactivity (PRPS1)

XLID-short stature-muscle wasting (NXF5)

Mitochondrial encephalopathy (NDUFA1) 23

21.2

21.3

22.122.222.3

XLID-hyperekplexia-seizures (ARHGEF9)

Epilepsy-intellectual disability limited to females (PCDH19)Martin-Probst (RAB40AL)

Wilson-Turner (LAS1L)

XLID-Rolandic seizures (SRPX2)

XLID-hypogonadism-tremor (CUL4B)

Lowe (OCRL1)Simpson-Golabi-Behmel (GPC3) Lesch-Nyhan (HPRT)

Fragile XA (FMR1) MASA spectrum (L1CAM)

Börjeson-Forssman-Lehmann (PHF6)

XLID-growth hormone deficiency (SOX3)

Danon cardiomyopathy (LAMP2)XLID-nail dystrophy-seizures (UBE2A)XLID-macrocephaly-Marfanoid habitus (ZDHHC9)

Christianson, Angelman-like (SLC9A6)

FG/Lujan phenotype (UPF3B)Chiyonobu XLID (GRIA3)

25

26

24

Microcephaly-pachygyria-dysmorphism (NSDHL)

Mucopolysaccharidosis IIA (IDS)Myotubular myopathy (MTM1)

Adrenoleukodystrophy (ABCD1)

Hydrocephaly-

Rett, PPM-X (MECP2)* Incontinentia pigmenti (IKBKG, NEMO)Dyskeratosis congenita (DKC1)

Periventricular nodular heterotopia, Otopalatodigital I, Otopalatodigital II, Melnick-Needles

(FLNA, FLN1)

Creatine transporter deficiency (SLC6A8) *XLID-hypotonia-recurrent infections (MECP2 dup)

Autism (RPL10)28

27

XLID-macrocephaly-seizures-autism (RAB39B)

N-Alpha acetyltransferase deficiency (NAA10)

Figure 1. Genes with Identified Mutations that Cause Syndromal XLID with Chromosomal Band Location

580 The American Journal of Human Genetics 90, 579–590, April 6, 2012

that might correlate directly with clinical conditions.32

Thus, the initial observation that the two brothers referred

to the laboratory because of ID had a consistent chromatid

break or constriction in the distal long arm of a large C

group chromosome was very pertinent to the research

goals of the laboratory. Further study revealed that their

normal mother and two maternal relatives with ID (an

uncle and great uncle of the boys) had the same marker

X chromosome.

The pedigree was, of course, consistent with X-linked ID.

Studies with H3 thymidine showed that the late repli-

cating, large C group chromosome was the same as the

chromosome with the apparent breaks and secondary

constrictions. The data led to the conclusion that ‘‘either

the secondary constriction itself or a closely linked

recessive gene may account for the pattern of X-linked

inheritance’’.4 This was, in fact, probably the first precise

localization of a gene associated with human disease. The

fragile X locus was subsequently defined as an uncoiled

region (secondary constriction) by electron microscopy.33

Studies from a number of laboratories would provide a

more precise confirmation and molecular characterization

22.322.2

22.1

CDKL5 (STK9)

( )ARX (29,32,33,

NLGN4

RPSKA3 (RSK2) (19)AP1S2 (59)CLCN4 (49)

21.321.221.1

11.411.311.23

IL1RAPL1 (21,34)( , , ,

36,38,43,54,76)

TM4SF2 (58)

PQBP1 (55)ZNF81 (45)ZNF674 (92)

(9 44)

ZNF41 (89)

13

1112

11.1

11.2211.21

OPHN1 (60)

FGDY

( )FTSJ1 (9,44)KDM5C (SMX, JARID1C)

DLG3 (8, 90)

SLC16A2 (MCT8)NLGN3

KLF8 (ZNF741)

HUWE1 (17, 31)**

IQSEC2 (1,18)

21.121.2

21.3

22.1ACSL4 (FACL4) (63 68)

ZDHHC15 (91)

SRPX2

MAGT1 (IAP)ATRX (XNP)

25

24

23

22.222.3

PAK3 (30,47)

,

ARHGEF6 ( PIX) (46)

AGTR2 (88)

UPF3B (62)NDUFA1

THOC2 (12)

28

26

27AFF2 (FMR2, FRAXE)

GDI1 (41, 48)MECP2 (16,64,79)*SLC6A8

RAB39B (72)

HCFC1 (3)

*MRX64 is due to a dupMECP2**MRX17 and MRX31 are due to dup HUWE1 and 2 adjacent genes

Figure 2. Location of Genes with Mutations that Cause Nonsyn-dromal XLIDTwenty-two genes shown on the left of the chromosome withsolid arrows cause nonsyndromal XLID only. Numbers in paren-theses adjacent to the gene symbols are assigned MRX numbers.Seventeen genes shown on the right of the chromosome withopen arrows cause both syndromal and nonsyndromal XLID.

Table 1. Nonsyndromal XLID families (MRX1 – MRX95) withlinkage or gene identificationa

1 IQSEC2 33 ARX 65 Xp11.3-q21.33

2 Xp22.1-p22.3 34 del IL1RAPL1 66 Xq21.33-q23

3 HCFC1 35 Xq21.3-q26 67 Xq13.1-q21.31

4 Xp11.22-q21.31 36 ARX 68 ACSL4

5 Xp21.1-q21.3 37 Xp22.31-p22.32 69 Xp11.21-q22.1

6 Xq27 38 ARX 70 Xq23-q25

7 Xp11.23-q12 39 Xp11 71 Xq24-q27.1

8 DLG3 40 Xq21 72 RAB39B

9 FTSJ1 41 GDI1 73 Xp22-p21

10 Xp11.4-p21.3 42 Xp11.3-q13.1; Xq26 74 Xp11.3-p11.4

11 Xp11.22-p21.3 43 ARX 75 Xq24-q26

12 THOC2 44 FTSJ1 76 ARX

13 Xp22.3-q22 45 ZNF81 77 Xq12-q21.33

14 Xp21.2-q13 46 ARHGEF6 78 Xp11.4-p11.23

15 Xp22.1-q12 47 PAK3 79 MECP2

16 MECP2 48 GDI1 80 Xq22-q24

17 dup HUWE1 49 CLCN4 81 Xp11.2-q12

18 IQSEC2 50 Xp11.3-p11.21 82 Xq24-q25

19 RPSKA3 51 Xp11.23-p11.3 83 Not published

20 Xp21.1-q23 52 Xp11.21-q21.32 84 Xp11.3-q22.3

21 IL1RAPL1 53 Xq22.2-q26 85 Xp21.3-p21.1

22 Xp21.1-q21.31 54 ARX 86 Not published

23 Xq23-q24 55 PQBP1 87 ARX

24 Xp22.2-p22.3 56 Xp21.1-p11.21 88 AGTR2

25 Xq27.3 57 Xq24-q25 89 ZNF41

26 Xp11.4-q23 58 TM4SF2 90 DLG3

27 Xq24-q27.1 59 AP1S2 91 ZDHHC15

28 Xq27.3-qter 60 OPHN1 92 ZNF674

29 ARX 61 Xq13.1-q25 93 BRWD3

30 PAK3 62 UPF3B 94 GRIA3

31 dup HUWE1 63 ACSL4 95 MAGT1/OSTb

32 ARX 64 dup MECP2

aMutations inNLGN4, CDKL5, KDM5C, FGD1, SLC16A2, ATRX, AFF2 and SLC6A8have been found in other families with nonsyndromal XLID.

The American Journal of Human Genetics 90, 579–590, April 6, 2012 581

of the location in the ensuing decade34–36 and identifica-

tion of the gene itself in 1991.37–40

In addition, the juxtaposition and timing of the family

study and the population survey permitted us to look for

the marker X in 5,000 individuals and over 30,000 cells

and to conclude tentatively that it was not a common

marker or variant because not even one marker X cell

was observed. Another family with a similar chromosomal

appearance at distal 16q was also ascertained in this same

interval. This was inherited in an autosomal-dominant

manner and not associated with a disease. We were, there-

fore, able to make the preliminary conclusion that such

markers did not necessarily indicate disease but that the

marker X was a significant clinical marker for a Mendelian

disease and hence a new and useful tool.

Observations in the 1970s and 1980s

More complex and folic-acid-enriched media become

popular during the 1970s and presumably made detection

of the fragile X increasingly difficult. Most early studies

gave variable results and were not published. The initial

report was confirmed by Giraud et al.34 and Harvey

et al.35 These articles and the report by Sutherland36 estab-

lished that folic acid in the culture media prevented the

expression and detection of the fragile X.

During the 1980s it became clear that a majority of XLID

families did not have fragile X, and the identification and

study of large non-fragile X XLID families with linkage

analysis began in earnest. Large scale studies began across

the globe at this time. The results summarized in Table 1,

Tables S1 and S2, and Figures 1, 2, and 3 are, therefore,

based on about 20 years of clinical and molecular studies.

Methodologies Quicken the Pace of Gene Discovery

Besides the cytogenetic methods used in the diagnosing

and confirmation of fragile X, a number of strategies

have been utilized to identify XLID genes (Table S1 and

Figures 1 and 2). Prior to 1990, these were limited to the

pursuit of genes in cases where the gene products (enzymes

in all cases: HPRT [MIM 308000], PGK1 [MIM 311800],

OTC [MIM 311250] , and PDHA1 [MIM 300582]) were

known, the molecular pathway was known (PLP [MIM

300401]) or a chromosome aberration had localized the

candidate region (DMD [MIM 300377]). Over the next

decade and a half, exploitation of chromosome rearrange-

ments and linkage coupled with candidate gene testing

dominated the field. In the past several years, X chromo-

some sequencing, microarrays (expression and genomic),

and exploration of molecular pathways have added to

the range of technologies available for XLID gene identifi-

cation. Five of the first seven gene identifications were

accomplished with a combination of known metabolic

pathways and tissue culture studies in families with inborn

errors of metabolism (Figure 4). The first identification,

Lesch-Nyhan syndrome due to mutations in HPRT, was re-

ported in 198341 and the most recent was the creatine

Aicardi

Bertini

22

Dessay

CMT, lonasescu variant

Prieto

21

XLID-blindness-seizures-spasticityWieacker-Wolff Miles-Carpenter

11

1112

Goldblatt spastic paraplegiaXLID spastic paraplegia, type 7

XLID-macrocephaly-macroorchidism

13

21

AbidiShrimpton

XLID-telecanthus-deafness

XLID-hypogammaglobulinemia23

24

22

Ahmad MRXS7

CMT, Cowchock variantXLID-panhypopituitarism

Christian

25

26

27 XLID-coarse facies

Vitale: aphasia-coarse facies

Gustavson

CraniofacioskeletalHypoparathyroidism, X-linked

ArmfieldWaisman-LaxovaHereditary bullous dystrophy

28

XLID-microcephaly-testicular failure

Figure 3. Approximate Linkage Limits for XLID Syndromes for which the Genes Have Not Been Identified

582 The American Journal of Human Genetics 90, 579–590, April 6, 2012

transporter syndrome (MIM 300352) due to mutations in

SLC6A8 [MIM 300036].42 Mutations in seven genes were

identified by this methodology.

Two workhorse approaches have been responsible for

the great majority of subsequent gene identifications.

The first of these, based on the ascertainment of a patient

with both ID and a chromosomal rearrangement involving

the X chromosome, was used successfully in identifying

the gene associated with Duchenne muscular dystrophy

in 1987. A total of 31 genes (Table S1 and Figure 4) had

been identified by the middle of 2011 with this approach.

The second and most productive ‘‘workhorse’’ approach,

linkage study of XLID families followed by molecular

analysis of appropriate candidate genes, was employed

initially by a number of investigators in detecting and

characterizing FMR1 (MIM 309550). Subsequently, its use

has resulted in the identification of 43 mutant X genes.

With increasing ease of sequencing, the pace of gene iden-

tification by this route accelerated after 2003, as shown in

Table S1 and Figure 4.

The availability of brute force sequencing capability after

completion of the Human Genome Project has brought an

additional effective method of gene identification, and 21

have been reported since 2006 (Table S1 and Figure 4).

Whether sequencing of large series of sporadic males,

male siblings, or families with clear XLID will prove to be

the most effective use of this resource remains to be deter-

mined. The selection of pedigree-based subjects for

sequencing, however, has the advantage that segregation

of gene alterations can be tested. Since this approach often

permits a relatively straight-forward path to gene identifi-

cation, continued collection of both clinical data and

blood samples remains important. Exploitation of a specific

molecular finding has accounted for four gene identifica-

tions (FANCB [MIM 300515], PORCN [MIM 300651],

SMC1A/SM1L1 [MIM 300040], NDUFA1 [MIM 300078]).

Two other new technologies, expression array and array-

comparative genomic hybridization have, surprisingly,

been applied successfully in only two and one instance,

respectively. Expression array was used in combination

with two other methods to discover the role of GRIA3

(MIM 305915) and PTCHD1 (MIM 300828) in ID. Array-

CGH was used in the isolation of the mutant gene in one

nonsyndromal family (HUWE1 [MIM 300697]).43 Many

potentially valuable combinations of array technologies

for screening followed with brute force sequencing can

Figure 4. The Year and Methodology Used to Identify Genes Associated with XLIDThe following abbreviations are used: Exp-Arr ¼ expression microarray. MCGH ¼ genomic microarray. X-seq ¼ gene sequencing.Mol-Fu ¼ follow up of a known molecular pathway. L-can ¼ candidate gene testing within a linkage interval. Chr-rea ¼ positionalcloning based on a chromosome rearrangement. Met-Fu ¼ follow up of a known metabolic pathway.

The American Journal of Human Genetics 90, 579–590, April 6, 2012 583

be envisioned. Detection of a consistent up or downregula-

tion or other abnormality in two or more XLID family

members can certainly be envisioned as a fruitful approach

to the selection of subjects for partial or complete X

sequencing. Two or more approaches were used in combi-

nation in six instances among the 102 gene identifications

shown in Table S1 and Figure 1 (FMR1, MID1 [MIM

602148], SOX3 [MIM 313430], HUWE1, CASK [MIM

300172], and GRIA3). The application of CGH and related

methods in conjunction with a variety of molecular

technologies has increasingly been used to detect du-

plications and deletions of genes associated with XLID

(Figure 5).1,43–56

In spite of the identification of mutations in 102 genes

that result in XLID, the fragile X syndrome continues to

be by far the most frequent XLID syndrome. Whether

the gradual but continuous expansion of the number of

triplet repeats in the large bank of premutation carriers,

which vary from 1/113 in Israel to 1/313–382 in the United

States) plays a role in maintaining its relatively high gene

frequency is unknown.57

Lumping, Splitting, and Reclassification Based on

Gene Discovery: A Model for Future Research

Given the variability and imprecision with which clinical

evaluations are carried out, it is inevitable that some indi-

viduals with X-linked ID will be incorrectly included in

existing diagnostic categories, whereas others will be incor-

rectly excluded. The extent to which individuals and

families can be evaluated is dependent on the setting,

access to historical information, availability and ages of

affected and nonaffected family members, and the ex-

perience and expertise of the observers. Differences in

phenotype can result frommutations in different domains

of a gene and by contributions from the balance of the

genome. The identification of mutations in many genes

associated with XLID has provided the opportunity to

compensate for some of these variables, resulting in the

lumping of entities previously considered to be separate

and the splitting of other entities previously considered

the same. In addition, the phenotypic limits of some

XLID entities were established with some degree of

objectivity.

Several XLID entities have been most instructive. Dis-

covery that mutations in ATRX (MIM 300032) (Xq21.1)

cause alpha-thalassemia ID allowed testing of large

number of males with hypotonic facies, ID, and other

features.58–60 Currently, as shown in Table S1, four other

named XLID syndromes (Carpenter-Waziri, Holmes-

Gang, XLID-Hypotonia-Arch Fingerprints, and Chudley-

Lowry syndromes [MIM 309580]) have been found to be

allelic variants of alpha-thalassemia ID as have certain

families with spastic paraplegia and nonsyndromal

XLID.1,61–65 One family clinically diagnosed as Juberg-

Marsidi syndrome was found to have an ATRX muta-

tion.66,67 This is now known to be based on misdiagnosis

of Juberg-Marsidi syndrome (MIM 300612); indeed, the

original family with this syndrome has a mutation in

HUWE1 at Xp11.22 (Friez et al., 2011, 15th International

Workshop on Fragile X and Other Early-Onset Cognitive

Disorders). One family clinically diagnosed as Smith-

Fineman-Myers syndrome was also found to harbor an

ATRX mutation, but the gene has not been analyzed in

the original family.68–70 A clinically similar condition,

Coffin-Lowry syndrome (MIM 303600), was found to be

separate from alpha-thalassemia ID and due to mutations

in RPS6KA3 (MIM 300075), which encodes a serine-threo-

nine kinase.71

Kalscheuer et al.72 found mutations in PQBP1 (MIM

300463) (Xp11.2) in two named XLID syndromes – Suther-

land-Haan syndrome (MIM 309470) and Hamel cerebropa-

latocardiac syndrome (MIM 309500)—in MRX55 and

two other families with microcephaly and other findings.

Lenski et al.,73 Stevenson et al.,74 and Lubs et al.75 added

Renpenning, Porteous, and Golabi-Ito-Hall syndromes to

the list of XLID syndromes caused by mutations in

PQBP1.73–75 The six phenotypes now attributed to muta-

tions in PQBP1 are now summarized in the allelic variants

of OMIM 300463. As with the ATRX phenotypes, a wide

variety of phenotypic expressions result from different

mutations in PQBP1 and we remain challenged to better

understand the molecular and developmental mecha-

nisms leading to these differences.

Mutations in ARX (MIM 300382) (Xp22.2) were also

found to be an important cause of XLID encompassing

Wagenstaller et al.54, Horn et al.50

Gijsbers et al.49

22.322.2

22.1

Whibley et al.55

F t l 44

21.321.221.1

11.4

Froyen et al.45

royen e a .

Bedeschi et al.481112

11.3

11.1

11.2311.2211.21

Koolen et al.46

13

21.121.2

21 3

Mimault et al.51, Woodward et al.56

Koolen et al.4623

.

22.122.222.3

Koolen et al.46

S l t

25

26

27

24

Solomon et al.53

Van Esch et al.47, Friez et al.43Rio et al.52

28

Figure 5. Location of Segmental Duplications Associated withSyndromal or Nonsyndromal XLID43–56

584 The American Journal of Human Genetics 90, 579–590, April 6, 2012

multiple phenotypes. Alterations, most commonly a 24 bp

expansion of a polyalanine tract, were found in a number

of families with nonsyndromal XLID (MRX29, 32, 33, 36,

38, 43, 54, and 76), an X-linked dystonia (Partington

syndrome [MIM 309510]), X-linked infantile spasms

(MIM 308350) (West syndrome), X-linked lissencephaly

with abnormal genitalia (MIM 300215), hydranencephaly

and abnormal genitalia (MIM 300215), and Proud

syndrome (MIM 300215).76–83

Perhaps the most prominent example of syndrome split-

ting is FG syndrome (MIM 305450). This syndrome,

initially described in 1974 by Opitz and Kaveggia,84 is

manifest by macrocephaly (or relative macrocephaly),

downslanting palpebral fissures, imperforate anus or

severe constipation, broad and flat thumbs and great

toes, hypotonia, and ID. In the ensuing years, the manifes-

tations attributed to FG syndrome have become protean,

but none was pathognomonic or required for the

diagnosis.85–88 As a result, a number of different localiza-

tions on the X chromosome were proposed for FG

syndrome.89–95

In 2007, Risheg et al.96 found a recurring mutation,

c.2881C>T (p.Arg961Trp), in MED12 (MIM 300188) in

six families with the FG phenotype, including the original

family reported by Opitz and Kaveggia.84 In addition to the

above noted manifestations, two other findings, small ears

and friendly behavior, were consistently noted.

Although most individuals who have carried the FG

diagnosis have one or more findings that overlap with

those in FG syndrome, they do not have MED12 muta-

tions.97,98 Some have been found to have mutations in

other X-linked genes (FMR1, FLNA [MIM 300017], ATRX,

CASK, and MECP2 [MIM 300005]), whereas others have

duplications or deletions of the autosomes.97 So great is

the currently existing heterogeneity within FG syndrome

that the vast majority of individuals so designated should

best be considered to have ID of undetermined cause.

In a number of instances, certain gene mutations have

been associated with nonsyndromal XLID, whereas other

mutations within the same genes have caused syndromal

XLID. Mutations in 17 genes that may cause either type

of XLID, depending on the mutation, have been identified

(Figure 2). In some cases (e.g., those with OPHN1 [MIM

300127] and ARX mutations) re-examination has found

syndromal manifestations in families previously consid-

ered to have nonsyndromal XLID.79,99,100

The frequency with which the process of lumping and

splitting in this limited field of investigation has occurred

has been extremely instructive to both clinical and molec-

ular investigators. Moreover, the process of reclassifying

and refining the XLID syndromes in light of the gene iden-

tificationsmay be one of themost important contributions

by medical genetics to clinical medicine. The underlying

mechanisms or pathways by which mutations in different

genes result in similar phenotypes and different mutations

in a single gene result in disparate phenotypes, however,

remain to be fully elucidated.

Improved Understanding of Disease Mechanisms

in XLID Disorders

Analysis of the presently known 102 genes associated with

XLID lends some insight into the numerous molecular

functions in which disruption can lead to cognitive

impairment and impaired brain development.17 Three

major functions are almost equally represented in proteins

encoded by this panel of 102 genes: 22% are involved in

regulation of transcription, 19% in signal transduction,

and 15% in metabolism. Additionally, 15% are compo-

nents of membrane-associated functions. The remainder

are equally distributed (~3%–5%) in seven other cellular

functions: cytoskeleton, RNA processing, DNA metabo-

lism, protein synthesis, ubiquitinization, cell cycle, and

cell adhesion. Regarding their localization within a cell,

the proteins encoded by genes associated with XLID are

almost equally distributed among the four major subcel-

lular fractions: 30% in the nucleus, 28% in the cytoplasm,

18% in the membranes, and 16% in cellular organelles.17

The XLID disorders offer many opportunities for under-

standing the functions of specific genes and their interac-

tions with other genes in producing disease. Studies

involving control of gene expression will necessarily be

especially complex. These have just begun, in part because

of their complexity and the rapid development of new tech-

niques. Only recently, for example, has a preliminary ex-

pressionmicroarray analysis been carried out in twoaffected

fragile X males.101 The study identified over 90 genes with

a greater than 1.5-fold change in expression. Overrepre-

sented genes were involved in signaling (both under-

and overexpression), morphogenesis (underexpression),

and neurodevelopment and function (overexpression).

Although not addressed in this study, the possibility that

a hallmark finding in the fragile X syndrome, enlargement

of the testes, might result from altered control of tubular

growth by a specific target gene is intriguing. One of the

90 genes identified, NUT (nuclear protein in testis [MIM

608963]), which is normally only expressed in the testis,

should be a candidate gene in future studies because the

BRDA-NUT fusion oncogenes are critical growth promoters

in certain aggressive carcinomas.102 Alternatively, a more

general growth-controlling gene might also explain the

prognathism, macrocephaly and large hands which occur

in some individuals with the fragile X syndrome.

Studies directed at understanding the mechanisms

underlying recurring clinical problems in XLID disorders

such as short stature, microcephaly or macrocephaly,

autistic behavior, and structural CNS abnormalities103

are also particularly appealing because they provide an

opportunity both to simultaneously understand critical

pathways, such as in dendrite development and the devel-

opment of XLID structural abnormalities, gene expression,

and phenotype. The association of autism spectrum dis-

order with mutations in at least eight of the 102 genes

listed in Table S1 is of particular current interest. This has

been reported most frequently in the fragile X syndrome

and Rett syndrome but also in disorders resulting from

The American Journal of Human Genetics 90, 579–590, April 6, 2012 585

mutations in NLGN3 (MIM 300336), NLGN4 (MIM

300427), RPL10 (MIM 312173), RAB39B (MIM 300774),

PTCHD1, and MED12. These genes, however, affect a wide

range of functions (Table S1), and the cause of the clinical

overlap is not clear. In nonsyndromal XLID, for example,

mutations have been identified in five genes involved in

the RhoGTPase cycle that affect dendritic outgrowth

(OPHN1, PAK3 [MIM 300142], ARHGEF6 [MIM 300267],

TM4SF2 [MIM 300096], and GDI1 [MIM 300104]) and are

central to the development of the nonsyndromal pheno-

type.1,17,104

The limited imaging and direct studies of macrocephaly,

microcephaly, and cerebellar hypoplasia have recently

been summarized,104 but more extensive application of

anatomical and functional brain imaging and spectros-

copy techniques that can identify variations in specific

brain regions for each disorder, in conjunction with both

clinical observations and psychometric studies, is critically

needed.

Detection of Possible Advantageous Cognitive

and Behavioral Genes

The identification of 102 X-linked genes affecting intelli-

gence has raised the probability that X chromosomal genes

(including XLID genes) might play a particularly impor-

tant role in brain structure and function as well as a specific

role in intelligence and certain cognitive abilities. Clearly,

as discussed at the beginning of this paper, the research

planned and carried out to identify XLID genes and

syndromes over the last several decades might account

for part or even all of this relative excess compared to auto-

somal loci. A number of papers, however, have addressed

the issue of active selection during evolution for X chro-

mosomal localization of important brain and cognitive

genes.2,105,106 The finding that human and mouse X chro-

mosome genes are hyperexpressed in the CNS compared to

autosomal genes provided additional important confirma-

tory data for the hypothesis of positive evolutionary selec-

tion.107 These studies showed not only that there was a

doubling of X chromosome expression (compared to auto-

somes) early in development (leading to dosage compensa-

tion), but overexpression in human CNS tissue and in

mouse CNS tissue increased by 2.83 and 2.53, respec-

tively, compared to expression in somatic tissues. These

observations also support the general idea that X genes

are particularly important for brain development and

function. Mutations significantly improving intellectual,

creative, perceptive, and leadership qualities would be

fully expressed in males and reasonably could have been

positively selected for in a relatively short period of time

in contrast to the negative selection for XLID muta-

tions.108–112 In essence, the XY males may have been the

experimental animal and the XX female, the storage

facility for both advantageous and deleterious mutations.

Medical investigations generally focus on adverse effects

and no organized searches for X-linked pedigrees with

particularly high intellectual or special cognitive talents

have been reported. Thus, the same approach that has

been effective in identifyingXLID syndrome genes, investi-

gating families with an X-linked pattern of intellectual

outliers, might also prove rewarding for studies at the other

end of the intellectual spectrum. What if we selected for

families with an X-linked pattern of high intellectual

accomplishment; special talents in art or music; unique

types of cognitive behavior involving memory, problem

solving, or, indeed, any type of special intellectual accom-

plishment such as Nobel awards in Economics or Physics?

Such families will certainly be uncommon but so are most

XLID disorders. Yet families might be identified if academi-

cians asked the pertinent family history questions during

lunch with colleagues, a dedicated, interactive home page

was available, or notices were placed in journals asking for

information about possible families. The same group of

laboratories that contributed to the data in Table S1 would

be logical sources for referral andmolecular studies because

the necessary cognitive and molecular studies are already

in place. A positive result might be even bemore important

to society than XLID disease description and provide

important insight into human evolution.

Although there is a wide array of pertinent cognitive

tests, these were not designed to detect specific familial

talents. The coapplication of a pedigree analysis with perti-

nent laboratory tests should provide sufficiently precise

initial diagnosis of the affected to carry out linkage and

array or other screening tests successfully. One family

with four to five outstanding individuals over several

generations could provide sufficient data to warrant testing

other families (or even other species) and to begin an iden-

tification process similar to that described in this paper

that has proven successful for XLID. Imagine the prospects

for investigating specific gene-environmental interactions

during learning and development!

Why, other than not having looked seriously, have

we not stumbled upon such families? Perhaps we have.

In the Inaugural Book of the new National Museum

of the American Indian, Native Universe, Voices of Indian

America,113 in which tribal leaders, writers, scholars, and

story tellers describe Indian traditions and heritages, the

following is recounted:

‘‘Story tells us that a group split from the Lenni Lenape,

perhaps a thousand years ago or more. The people then

settled on the Eastern Shore of the Chesapeake, and were

one and the same as the Nanticoke. Then, for some reason,

the first Tayac, Uttapoingassenum, led his people to the

other side of the bay. Upon their arrival, they encountered

peoples who had been living on the land for more than

8,000 years, according to various archeological estimates.

For thirteen generations prior to English settlement, as

told to Jesuit andMoravianmissionaries, the Tayac’s inher-

itance passed from brother to brother and then to the

sister’s sons. Each led the people until his death.’’

The possibility that the Nanticoke had intuitively recog-

nized and employed a quality of leadership that followed

an X-linked pattern of inheritance is intriguing to consider.

586 The American Journal of Human Genetics 90, 579–590, April 6, 2012

Although much progress has been made during the past

four decades, the clinical and molecular delineation of

XLID is far from complete. Perhaps little more than half

of the genes in which mutations will result in XLID have

been identified. The molecular pathways are incompletely

understood, the mechanisms by which brain structure and

function are deranged have not been identified, and with

few exceptions the neurobehavioral profiles and natural

history of the XLID entities have received insufficient

attention. These deficiencies notwithstanding, consider-

able benefits have been gained for individuals with XLID

and their families. Specific molecular tests, including mul-

tigene panels, are now available to more efficiently reach

a diagnosis. Carrier testing, donor eggs, prenatal diagnosis,

and preimplantation genetic testing may be used to

prevent recurrence when a specific genemutation is found.

Through these measures, reproductive confidence may be

restored for families in which XLID has occurred.114

Supplemental Data

Supplemental Data include two tables and can be found with this

article online at http://www.cell.com/AJHG/.

Web Resources

The URLs for data presented herein are as follows:

Greenwood Genetic Center, XLID Update, http://www.ggc.org/

research/molecular-studies/xlid.html

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org/

References

1. Stevenson, R.E., Schwartz, C.E., and Rogers, R.C. (2012).

Atlas of X-Linked Intellectual Disability Syndromes (New

York: Oxford University Press).

2. Skuse, D.H. (2005). X-linked genes and mental functioning.

Hum. Mol. Genet. 14 (Spec No 1), R27–R32.

3. Gecz, J., Shoubridge, C., and Corbett, M. (2009). The genetic

landscape of intellectual disability arising from chromo-

some X. Trends Genet. 25, 308–316.

4. Lubs, H.A. (1969). A marker X chromosome. Am. J. Hum.

Genet. 21, 231–244.

5. Kaiser-McCaw, B., Hecht, F., Cadien, J.D., and Moore, B.C.

(1980). Fragile X-linked mental retardation. Am. J. Med.

Genet. 7, 503–505.

6. Opitz, J.M., and Sutherland, G.R. (1984). Conference report:

International workshop on the fragile X and X-linked intel-

lectual disability. Am. J. Med. Genet. 17, 5–94.

7. Turner, G., Opitz, J.M., Brown, W.T., Davies, K.E., Jacobs,

P.A., Jenkins, E.C., Mikkelson, M., Partington, M.W., and

Sutherland, G.R. (1986). Conference report: Second interna-

tional workshop on the fragile X and on X-linked mental

retardation. Am. J. Med. Genet. 23, 11–67.

8. Neri, G., Opitz, J.M., Mikkelson, M., Jacobs, P.A., Davies, K.,

and Turner, G. (1988). Conference report: Third interna-

tional workshop on the fragile X and X-linked mental

retardation. Am. J. Med. Genet. 30, 1–29.

9. Neri, G., Gurrieri, F., Gal, A., and Lubs, H.A. (1991). XLMR

genes: Update 1990. Am. J. Med. Genet. 38, 186–189.

10. Neri,G.,Chiurazzi,P.,Arena,F.,Lubs,H.A., andGlass, I.A. (1992).

XLMR genes: Update 1992. Am. J. Med. Genet. 43, 373–382.

11. Neri, G., Chiurazzi, P., Arena, J.F., and Lubs, H.A. (1994).

XLMR genes: Update 1994. Am. J. Med. Genet. 51, 542–549.

12. Brown, W.T., Jenkins, E., Neri, G., Lubs, H., Shapiro, L.R.,

Davies, K.E., Sherman, S., Hagerman, R., and Laird, C.

(1991). Conference report: Fourth international workshop

on the fragile X and X-linked mental retardation. Am. J.

Med. Genet. 38, 158–172.

13. Lubs, H.A., Chiurazzi, P., Arena, J.F., Schwartz, C., Traneb-

jaerg, L., and Neri, G. (1996). XLMR genes: update 1996.

Am. J. Med. Genet. 64, 147–157.

14. Lubs, H., Chiurazzi, P., Arena, J., Schwartz, C., Tranebjaerg,

L., and Neri, G. (1999). XLMR genes: Update 1998. Am. J.

Med. Genet. 83, 237–247.

15. Chiurazzi, P., Hamel, B.C., and Neri, G. (2001). XLMR genes:

Update 2000. Eur. J. Hum. Genet. 9, 71–81.

16. Chiurazzi, P., Schwartz, C.E., Gecz, J., and Neri, G. (2008).

XLMR genes: Update 2007. Eur. J. Hum. Genet. 16, 422–434.

17. Ropers, H.H. (2008). Genetics of intellectual disability. Curr.

Opin. Genet. Dev. 18, 241–250.

18. Chelly, J., Khelfaoui, M., Francis, F., Cherif, B., and Bienvenu,

T. (2006). Genetics and pathophysiology of mental retarda-

tion. Eur. J. Hum. Genet. 14, 701–713.

19. Ropers, H.H., and Hamel, B.C. (2005). X-linked mental

retardation. Nat. Rev. Genet. 6, 46–57.

20. Kleefstra, T., and Hamel, B.C. (2006). X-linked mental retar-

dation: Further lumping, splitting and emerging pheno-

types. Clin. Genet. 67, 451–467.

21. Stevenson, R.E., and Schwartz, C.E. (2002). Clinical and

molecular contributions to the understanding of X-linked

mental retardation. Cytogenet. Genome Res. 99, 265–275.

22. Neri, G., and Opitz, J.M. (2000). Sixty years of X-linked

mental retardation: A historical footnote. Am. J. Med. Genet.

97, 228–233.

23. Martin, J.P., and Bell, J. (1943). A pedigree of mental defect

showing sex-linkage. J. Neurol. Psychiatry 6, 154–157.

24. Allan, W., Herndon, C.N., and Dudley, F.C. (1944). Some

examples of the inheritance ofmental deficiency: Apparently

sex-linked idiocy and microcephaly. Am. J. Ment. Defic. 48,

325–334.

25. Bickers, D.S., and Adams, R.D. (1949). Hereditary stenosis of

the aqueduct of Sylvius as a cause of congenital hydroceph-

alus. Brain 72, 246–262.

26. Losowsky,M.S. (1961). Hereditarymental defect showing the

pattern of sex influence. J. Ment. Defic. Res. 5, 60–62.

27. Renpenning, H., Gerrard, J.W., Zaleski, W.A., and Tabata, T.

(1962). Familial sex-linked mental retardation. Can. Med.

Assoc. J. 87, 954–956.

28. Dunn, H.G., Renpenning, H., Gerrard, H.W., Miller, J.R.,

Tabata, T., and Federoff, S. (1963). Mental retardation as

a sex-linked defect. Am. J. Ment. Defic. 67, 827–848.

29. Penrose, L.S. (1938). A clinical and genetic study of 1280 cases

of mental defect. Special Report Series, Medical Research

Council, No. 229 (London: His Majesty’s Stationery Office).

30. Lehrke, R.G. (1974). X-linked mental retardation and verbal

disability. Birth Defects Orig. Artic. Ser. 10, 1–100.

31. Herbst, D.S., and Miller, J.R. (1980). Nonspecific X-linked

mental retardation II: The frequency in British Columbia.

Am. J. Med. Genet. 7, 461–469.

The American Journal of Human Genetics 90, 579–590, April 6, 2012 587

32. Lubs, H.A., and Ruddle, F.H. (1970). Chromosomal abnor-

malities in the human population: estimation of rates based

on New Haven newborn study. Science 169, 495–497.

33. Harrison, C.J., Jack, E.M., Allen, T.D., and Harris, R. (1983).

The fragile X: A scanning electron microscope study. J.

Med. Genet. 20, 280–285.

34. Giraud, F., Ayme, S., Mattei, J.F., and Mattei, M.G. (1976).

Constitutional chromosomal breakage. Hum. Genet. 34,

125–136.

35. Harvey, J., Judge, C., andWiener, S. (1977). Familial X-linked

mental retardation with an X chromosome abnormality. J.

Med. Genet. 14, 46–50.

36. Sutherland, G.R. (1977). Fragile sites on human chromo-

somes: Demonstration of their dependence on the type of

tissue culture medium. Science 197, 265–266.

37. Oberle, I., Rousseau, F., Heitz, D., Kretz, C., Kevys, D., Hana-

uer, A., Boue, J., Bertheas, M.F., and Mandel, J.L. (1991).

Instability of a 550-base pair DNA segment and abnormal

methylation in fragile X syndrome. Science 252, 1097–1102.

38. Bell, M.V., Hirst, M.C., Nakahori, Y., MacKinnon, R.N.,

Roche, A., Flint, T.J., Jacobs, P.A., Tommerup, N., Tranebjaerg,

L., Froster-Iskenius, U., et al. (1991). Physical mapping across

the fragile X: hypermethylation and clinical expression of

the fragile X syndrome. Cell 64, 861–866.

39. Yu, S., Pritchard, M., Kremer, E., Lynch, M., Nancarrow, J.,

Baker, E., Holman, K., Mulley, J., Warren, S., Schlessinger,

D., et al. (1991). Fragile X genotype characterized by an

unstable region of DNA. Science 252, 1179–1181.

40. Verkerk, A.J., Pieretti, M., Sutcliffe, J.S., Fu, Y.H., Kuhl, D.P.,

Pizzuti, A., Reiner, O., Richards, S., Victoria, M.F., Zhang,

F.P., et al. (1991). Identification of a gene (FMR-1) containing

a CGG repeat coincident with a breakpoint cluster region

exhibiting length variation in fragile X syndrome. Cell 65,

905–914.

41. Jolly, D.J., Okayama, H., Berg, P., Esty, A.C., Filpula, D.,

Bohlen, P., Johnson, G.G., Shively, J.E., Hunkapillar, T., and

Friedmann, T. (1983). Isolation and characterization of

a full-length expressible cDNA for human hypoxanthine

phosphoribosyl transferase. Proc. Natl. Acad. Sci. USA 80,

477–481.

42. Salomons, G.S., van Dooren, S.J., Verhoeven, N.M., Cecil,

K.M., Ball, W.S., Degrauw, T.J., and Jakobs, C. (2001).

X-linked creatine-transporter gene (SLC6A8) defect: A new

creatine-deficiency syndrome. Am. J. Hum. Genet. 68,

1497–1500.

43. Friez, M.J., Jones, J.R., Clarkson, K., Lubs, H., Abuelo, D., Bier,

J.A., Pai, S., Simensen, R., Williams, C., Giampietro, P.F., et al.

(2006). Recurrent infections, hypotonia, and mental retarda-

tion caused by duplication of MECP2 and adjacent region in

Xq28. Pediatrics 118, e1687–e1695.

44. Froyen, G., Van Esch, H., Bauters, M., Hollanders, K., Frints,

S.G., Vermeesch, J.R., Devriendt, K., Fryns, J.P., andMarynen,

P. (2007). Detection of genomic copy number changes in

patients with idiopathic mental retardation by high-resolu-

tion X-array-CGH: Important role for increased gene dosage

of XLMR genes. Hum. Mutat. 28, 1034–1042.

45. Froyen, G., Corbett, M., Vandewalle, J., Jarvela, I., Lawrence,

O., Meldrum, C., Bauters,M., Govaerts, K., Vandeleur, L., Van

Esch, H., et al. (2008). Submicroscopic duplications of the

hydroxysteroid dehydrogenase HSD17B10 and the E3 ubiq-

uitin ligase HUWE1 are associated with mental retardation.

Am. J. Hum. Genet. 82, 432–443.

46. Koolen, D.A., Pfundt, R., de Leeuw, N., Hehir-Kwa, J.Y., Nille-

sen, W.M., Neefs, I., Scheltinga, I., Sistermans, E., Smeets, D.,

Brunner, H.G., et al. (2009). Genomic microarrays in mental

retardation: A practical workflow for diagnostic applications.

Hum. Mutat. 30, 283–292.

47. VanEsch,H., Bauters,M., Ignatius, J., Jansen,M., Raynaud,M.,

Hollanders, K., Lugtenberg,D., Bienvenu,T., Jensen, L.R.,Gecz,

J., et al. (2005). Duplication of the MECP2 region is a frequent

causeof severemental retardation andprogressiveneurological

symptoms in males. Am. J. Hum. Genet. 77, 442–453.

48. Bedeschi, M.F., Novelli, A., Bernardini, L., Parazzini, C.,

Bianchi, V., Torres, B., Natacci, F., Giuffrida, M.G., Ficarazzi,

P., Dallapiccola, B., and Lalatta, F. (2008). Association of syn-

dromic mental retardation with an Xq12q13.1 duplication

encompassing the oligophrenin 1 gene. Am. J. Med. Genet.

A. 146A, 1718–1724.

49. Gijsbers, A.C., denHollander, N.S., Helderman-van de Enden,

A.T., Schuurs-Hoeijmakers, J.H., Vijfhuizen, L., Bijlsma, E.K.,

van Haeringen, A., Hansson, K.B., Bakker, E., Breuning,

M.H., and Ruivenkamp, C.A. (2011). X-chromosome duplica-

tions inmales withmental retardation: Pathogenic or benign

variants? Clin. Genet. 79, 71–78.

50. Horn, D., Spranger, S., Kruger, G., Wagenstaller, J., Weschke,

B., Ropers, H.H., Mundlos, S., Ullmann, R., Strom, T.M., and

Kiopocki, E. (2007). Microdeletions and microduplications

affecting the STS gene at Xp22.31 are associated with a

distinct phenotypic spectrum. Medizinische Genetik 19, 62.

51. Mimault, C., Giraud, G., Courtois, V., Cailloux, F., Boire, J.Y.,

Dastugue, B., and Boespflug-Tanguy, O.; The Clinical Euro-

pean Network on Brain Dysmyelinating Disease. (1999).

Proteolipoprotein gene analysis in 82 patients with sporadic

Pelizaeus-Merzbacher Disease: Duplications, the major cause

of the disease, originate more frequently in male germ cells,

but point mutations do not. Am. J. Hum. Genet. 65,

360–369.

52. Rio, M., Malan, V., Boissel, S., Toutain, A., Royer, G., Gobin,

S., Morichon-Delvallez, N., Turleau, C., Bonnefont, J.P.,

Munnich, A., et al. (2010). Familial interstitial Xq27.3q28

duplication encompassing the FMR1 gene but not the

MECP2 gene causes a new syndromic mental retardation

condition. Eur. J. Hum. Genet. 18, 285–290.

53. Solomon, N.M., Ross, S.A., Morgan, T., Belsky, J.L., Hol, F.A.,

Karnes, P.S., Hopwood, N.J., Myers, S.E., Tan, A.S., Warne,

G.L., et al. (2004). Array comparative genomic hybridisation

analysis of boys with X linked hypopituitarism identifies

a 3.9 Mb duplicated critical region at Xq27 containing

SOX3. J. Med. Genet. 41, 669–678.

54. Wagenstaller, J., Spranger, S., Lorenz-Depiereux, B., Kaz-

mierczak, B., Nathrath, M., Wahl, D., Heye, B., Glaser, D.,

Liebscher, V., Meitinger, T., and Strom, T.M. (2007).

Copy-number variations measured by single-nucleotide-

polymorphism oligonucleotide arrays in patients with

mental retardation. Am. J. Hum. Genet. 81, 768–779.

55. Whibley, A.C., Plagnol, V., Tarpey, P.S., Abidi, F., Fullston, T.,

Choma, M.K., Boucher, C.A., Shepherd, L., Willatt, L.,

Parkin, G., et al. (2010). Fine-scale survey of X chromosome

copy number variants and indels underlying intellectual

disability. Am. J. Hum. Genet. 87, 173–188.

56. Woodward, K., Palmer, R., Rao, K., and Malcolm, S. (1999).

Prenatal diagnosis by FISH in a family with Pelizaeus-

Merzbacher disease caused by duplication of PLP gene.

Prenat. Diagn. 19, 266–268.

588 The American Journal of Human Genetics 90, 579–590, April 6, 2012

57. Hantash, F.M.,Goos,D.G.,Tsao,D.,Quan, F., Buller-Burckle,A.,

Peng, M., Jarvis, M., Sun, W., and Strom, C.M. (2010). Qualita-

tiveassessmentof FMR1(CGG)n triplet repeat status innormal,

intermediate, premutation, full mutation, and mosaic carriers

in both sexes: Implications for fragile X syndrome carrier and

newborn screening. Genet. Med. 12, 162–173.

58. Gibbons, R.J., Brueton, L., Buckle, V.J., Burn, J., Clayton-

Smith, J., Davison, B.C., Gardner, R.J., Homfray, T., Kearney,

L., Kingston, H.M., et al. (1995a). Clinical and hematologic

aspects of the X-linked alpha-thalassemia/mental retardation

syndrome (ATR-X). Am. J. Med. Genet. 55, 288–299.

59. Gibbons, R.J., Picketts, D.J., Villard, L., and Higgs, D.R.

(1995b). Mutations in a putative global transcriptional

regulator cause X-linked mental retardation with alpha-

thalassemia (ATR-X syndrome). Cell 80, 837–845.

60. Villard,L.,Bonino,M.C.,Abidi,F.,Ragusa,A.,Belougne, J.,Lossi,

A.M., Seaver, L., Bonnefont, J.P., Romano, C., Fichera, M., et al.

(1999). Evaluationof amutation screening strategy for sporadic

cases of ATR-X syndrome. J. Med. Genet. 36, 183–186.

61. Abidi, F., Schwartz, C.E., Carpenter, N.J., Villard, L., Fontes,

M., and Curtis, M. (1999). Carpenter-Waziri syndrome results

from a mutation in XNP. Am. J. Med. Genet. 85, 249–251.

62. Lossi, A.M., Millan, J.M., Villard, L., Orellana, C., Cardoso,

C., Prieto, F., Fontes, M., and Martınez, F. (1999). Mutation

of the XNP/ATR-X gene in a family with severe mental

retardation, spastic paraplegia and skewed pattern of X inac-

tivation: Demonstration that the mutation is involved in the

inactivation bias. Am. J. Hum. Genet. 65, 558–562.

63. Abidi, F.E., Cardoso, C., Lossi, A.M., Lowry, R.B., Depetris, D.,

Mattei, M.G., Lubs, H.A., Stevenson, R.E., Fontes, M.,

Chudley, A.E., and Schwartz, C.E. (2005). Mutation in the

50 alternatively spliced region of the XNP/ATR-X gene causes

Chudley-Lowry syndrome. Eur. J. Hum. Genet. 13, 176–183.

64. Guerrini, R., Shanahan, J.L., Carrozzo, R., Bonanni, P., Higgs,

D.R., and Gibbons, R.J. (2000). A nonsense mutation of the

ATRX gene causing mild mental retardation and epilepsy.

Ann. Neurol. 47, 117–121.

65. Yntema, H.G., Poppelaars, F.A., Derksen, E., Oudakker, A.R.,

van Roosmalen, T., Jacobs, A., Obbema, H., Brunner, H.G.,

Hamel, B.C., and van Bokhoven, H. (2002). Expanding

phenotype of XNP mutations: Mild to moderate mental

retardation. Am. J. Med. Genet. 110, 243–247.

66. Mattei, J.F., Collignon, P., Ayme, S., and Giraud, F. (1983).

X-linked mental retardation, growth retardation, deafness

and microgenitalism. A second familial report. Clin. Genet.

23, 70–74.

67. Villard, L., Gecz, J., Mattei, J.F., Fontes, M., Saugier-Veber, P.,

Munnich, A., and Lyonnet, S. (1996). XNP mutation in a

large family with Juberg-Marsidi syndrome. Nat. Genet. 12,

359–360.

68. Smith, R.D., Fineman, R.M., and Myers, G.G. (1980). Short

stature, psychomotor retardation, and unusual facial appear-

ance in two brothers. Am. J. Med. Genet. 7, 5–9.

69. Ades, L.C., Kerr, B., Turner, G., and Wise, G. (1991). Smith-

Fineman-Myers syndrome in two brothers. Am. J. Med.

Genet. 40, 467–470.

70. Villard, L., Fontes, M., Ades, L.C., and Gecz, J. (2000). Identi-

fication of a mutation in the XNP/ATR-X gene in a family

reported as Smith-Fineman-Myers syndrome. Am. J. Med.

Genet. 91, 83–85.

71. Trivier, E., De Cesare, D., Jacquot, S., Pannetier, S., Zackai, E.,

Young, I., Mandel, J.L., Sassone-Corsi, P., and Hanauer, A.

(1996). Mutations in the kinase Rsk-2 associated with

Coffin-Lowry syndrome. Nature 384, 567–570.

72. Kalscheuer, V.M., Freude, K., Musante, L., Jensen, L.R.,

Yntema, H.G., Gecz, J., Sefiani, A., Hoffmann, K., Moser, B.,

Haas, S., et al. (2004). Mutations in the polyglutamine

binding protein 1 gene cause X-linked mental retardation.

Nat. Genet. 35, 313–315.

73. Lenski, C., Abidi, F., Meindl, A., Gibson, A., Platzer, M., Frank

Kooy, R., Lubs, H.A., Stevenson, R.E., Ramser, J., and

Schwartz, C.E. (2004). Novel truncating mutations in the

polyglutamine tract binding protein 1 gene (PQBP1) cause

Renpenning syndrome and X-linked mental retardation in

another family with microcephaly. Am. J. Hum. Genet. 74,

777–780.

74. Stevenson, R.E., Bennett, C.W., Abidi, F., Kleefstra, T.,

Porteous, M., Simensen, R.J., Lubs, H.A., Hamel, B.C., and

Schwartz, C.E. (2005). Renpenning syndrome comes into

focus. Am. J. Med. Genet. A. 134, 415–421.

75. Lubs, H., Abidi, F.E., Echeverri, R., Holloway, L., Meindl, A.,

Stevenson, R.E., and Schwartz, C.E. (2006). Golabi-Ito-Hall

syndrome results from a missense mutation in the WW

domain of the PQBP1 gene. J. Med. Genet. 43, e30.

76. Strømme, P., Mangelsdorf, M.E., Scheffer, I.E., and Gecz, J.

(2002). Infantile spasms, dystonia, and other X-linked

phenotypes caused by mutations in Aristaless related

homeobox gene, ARX. Brain Dev. 24, 266–268.

77. Strømme, P., Mangelsdorf, M.E., Shaw, M.A., Lower, K.M.,

Lewis, S.M., Bruyere, H., Lutcherath, V., Gedeon, A.K.,

Wallace, R.H., Scheffer, I.E., et al. (2002). Mutations in the

human ortholog of Aristaless cause X-linked mental retarda-

tion and epilepsy. Nat. Genet. 30, 441–445.

78. Bienvenu, T., Poirier, K., Friocourt, G., Bahi, N., Beaumont,

D., Fauchereau, F., Ben Jeema, L., Zemni, R., Vinet, M.C.,

Francis, F., et al. (2002). ARX, a novel Prd-class-homeobox

gene highly expressed in the telencephalon, is mutated in

X-linked mental retardation. Hum.Mol. Genet. 11, 981–991.

79. Frints, S.G., Froyen, G., Marynen, P.,Willekens, D., Legius, E.,

and Fryns, J.P. (2002). Re-evaluation of MRX36 family after

discovery of an ARX genemutation reveals mild neurological

features of Partington syndrome. Am. J. Med. Genet. 112,

427–428.

80. Kitamura, K., Yanazawa, M., Sugiyama, N., Miura, H., Iizuka-

Kogo, A., Kusaka, M., Omichi, K., Suzuki, R., Kato-Fukui, Y.,

Kamiirisa, K., et al. (2002). Mutation of ARX causes abnormal

development of forebrain and testes inmice and X-linked lis-

sencephaly with abnormal genitalia in humans. Nat. Genet.

32, 359–369.

81. Uyanik, G., Aigner, L., Martin, P., Gross, C., Neumann, D.,

Marschner-Schafer, H., Hehr, U., and Winkler, J. (2003).

ARX mutations in X-linked lissencephaly with abnormal

genitalia. Neurology 61, 232–235.

82. Kato, M., Das, S., Petras, K., Kitamura, K., Morohashi, K.,

Abuelo, D.N., Barr, M., Bonneau, D., Brady, A.F., Carpenter,

N.J., et al. (2004). Mutations of ARX are associated with

striking pleiotropy and consistent genotype-phenotype

correlation. Hum. Mutat. 23, 147–159.

83. Stepp, M.L., Cason, A.L., Finnis, M., Mangelsdorf, M., Holin-

ski-Feder, E., Macgregor, D., MacMillan, A., Holden, J.J., Gecz,

J., Stevenson, R.E., and Schwartz, C.E. (2005). XLMR in MRX

families 29, 32, 33 and 38 results from the dup24mutation in

the ARX (Aristaless related homeobox) gene. BMC Med.

Genet. 6, 16.

The American Journal of Human Genetics 90, 579–590, April 6, 2012 589

84. Opitz, J.M., and Kaveggia, E.G. (1974). Studies of malforma-

tion syndromes of man 33: the FG syndrome. An X-linked

recessive syndrome of multiple congenital anomalies and

mental retardation. Z. Kinderheilkd. 117, 1–18.

85. Opitz, J.M., Richieri-da Costa, A., Aase, J.M., and Benke, P.J.

(1988). FG syndrome update 1988: note of 5 new patients

and bibliography. Am. J. Med. Genet. 30, 309–328.

86. Romano, C., Baraitser, M., and Thompson, E. (1994). A clin-

ical follow-up of British patients with FG syndrome. Clin.

Dysmorphol. 3, 104–114.

87. Ozonoff, S., Williams, B.J., Rauch, A.M., and Opitz, J.O.

(2000). Behavior phenotype of FG syndrome: cognition,

personality, and behavior in eleven affected boys. Am. J.

Med. Genet. 97, 112–118.

88. Battaglia, A., Chines, C., and Carey, J.C. (2006). The FG

syndrome: report of a large Italian series. Am. J. Med. Genet.

A. 140, 2075–2079.

89. Briault, S., Hill, R., Shrimpton, A., Zhu, D., Till, M., Ronce, N.,

Margaritte-Jeannin, P., Baraitser, M., Middleton-Price, H.,

Malcolm, S., et al. (1997). A gene for FG syndrome maps in

the Xq12-q21.31 region. Am. J. Med. Genet. 73, 87–90.

90. Briault, S., Villard, L., Rogner, U., Coy, J., Odent, S., Lucas, J.,

Passage, E., Zhu, D., Shrimpton, A., Pembrey, M., et al.

(2000). Mapping of X chromosome inversion breakpoints

[inv(X)(q11q28)] associated with FG syndrome: A second

FG locus [FGS2]? Am. J. Med. Genet. 95, 178–181.

91. Piluso, G., Carella, M., D’Avanzo, M., Santinelli, R., Carrano,

E.M., D’Avanzo, A., D’Adamo, A.P., Gasparini, P., and Nigro,

V. (2003). Genetic heterogeneity of FG syndrome: a fourth

locus (FGS4) maps to Xp11.4-p11.3 in an Italian family.

Hum. Genet. 112, 124–130.

92. Dessay, S., Moizard, M.P., Gilardi, J.L., Opitz, J.M., Middle-

ton-Price, H., Pembrey, M., Moraine, C., and Briault, S.

(2002). FG syndrome: linkage analysis in two families sup-

porting a new gene localization at Xp22.3 [FGS3]. Am. J.

Med. Genet. 112, 6–11.

93. Jehee, F.S., Rosenberg, C., Krepischi-Santos, A.C., Kok, F.,

Knijnenburg, J., Froyen, G., Vianna-Morgante, A.M., Opitz,

J.M., and Passos-Bueno, M.R. (2005). An Xq22.3 duplication

detected by comparative genomic hybridization microarray

(Array-CGH) defines a new locus (FGS5) for FG syndrome.

Am. J. Med. Genet. A. 139, 221–226.

94. Tarpey, P.S., Raymond, F.L., Nguyen, L.S., Rodriguez, J.,

Hackett, A., Vandeleur, L., Smith, R., Shoubridge, C., Edkins,

S., Stevens, C., et al. (2007). Mutations in UPF3B, a member

of the nonsense-mediated mRNA decay complex, cause syn-

dromic and nonsyndromic mental retardation. Nat. Genet.

39, 1127–1133.

95. Unger, S., Mainberger, A., Spitz, C., Bahr, A., Zeschnigk, C.,

Zabel, B., Superti-Furga, A., and Morris-Rosendahl, D.J.

(2007). Filamin A mutation is one cause of FG syndrome.

Am. J. Med. Genet. A. 143A, 1876–1879.

96. Risheg, H., Graham, J.M., Jr., Clark, R.D., Rogers, R.C., Opitz,

J.M., Moeschler, J.B., Peiffer, A.P., May, M., Joseph, S.M.,

Jones, J.R., et al. (2007). A recurrent mutation in MED12

leading to R961W causes Opitz-Kaveggia syndrome. Nat.

Genet. 39, 451–453.

97. Lyons, M.J., Graham, J.M., Jr., Neri, G., Hunter, A.G.W.,

Clark, R.D., Rogers, R.C., Moscarda, M., Boccuto, L., Simen-

sen, R., Dodd, J., et al. (2009). Clinical experience in the

evaluation of 30 patients with a prior diagnosis of FG

syndrome. J. Med. Genet. 46, 9–13.

98. Clark, R.D., Graham, J.M., Jr., Friez, M.J., Hoo, J.J., Jones, K.L.,

McKeown, C., Moeschler, J.B., Raymond, F.L., Rogers, R.C.,

Schwartz, C.E., et al. (2009). FG syndrome, an X-linked

multiple congenital anomaly syndrome: the clinical pheno-

type and an algorithm for diagnostic testing. Genet. Med.

11, 769–775.

99. Bergmann, C., Zerres, K., Senderek, J., Rudnik-Schoneborn,

S., Eggermann, T., Hausler, M., Mull, M., and Ramaekers,

V.T. (2003). Oligophrenin 1 (OPHN1) gene mutation causes

syndromic X-linked mental retardation with epilepsy, rostral

ventricular enlargement and cerebellar hypoplasia. Brain

126, 1537–1544.

100. Philip, N., Chabrol, B., Lossi, A.M., Cardoso, C., Guerrini, R.,

Dobyns, W.B., Raybaud, C., and Villard, L. (2003). Mutations

in the oligophrenin-1 gene (OPHN1) cause X linked congen-

ital cerebellar hypoplasia. J. Med. Genet. 40, 441–446.

101. Bittel, D.C., Kibiryeva, N., and Butler, M.G. (2007). Whole

genome microarray analysis of gene expression in subjects

with fragile X syndrome. Genet. Med. 9, 464–472.

102. French, C.A., Miyoshi, I., Kubonishi, I., Grier, H.E., Perez-

Atayde, A.R., and Fletcher, J.A. (2003). BRD4-NUT fusion

oncogene: A novel mechanism in aggressive carcinoma.

Cancer Res. 63, 304–307.

103. Stevenson, R.E., and Schwartz, C.E. (2009). X-linked intellec-

tual disability: Unique vulnerability of the male genome.

Dev. Disabil. Res. Rev. 15, 361–368.

104. Renieri, A., Pescucci, C., Longo, I., Ariani, F., Mari, F., and

Meloni, I. (2005). Non-syndromic X-linked mental retarda-

tion: From a molecular to a clinical point of view. J. Cell.

Physiol. 204, 8–20.

105. Zechner, U., Wilda, M., Kehrer-Sawatzki, H., Vogel, W.,

Fundele, R., and Hameister, H. (2001). A high density of

X-linked genes for general cognitive ability: A run-away

process shapinghumanevolution?TrendsGenet.17, 697–701.

106. Graves, J.A., Gecz, J., and Hameister, H. (2002). Evolution of

the human X—a smart and sexy chromosome that controls

speciation and development. Cytogenet. Genome Res. 99,

141–145.

107. Nguyen, D.K., and Disteche, C.M. (2006). Dosage compensa-

tion of the active X chromosome in mammals. Nat. Genet.

38, 47–53.

108. Turner, G., and Partington, M.W. (1991). Genes for intelli-

gence on the X chromosome. J. Med. Genet. 28, 429.

109. Turner, G. (1996). Finding genes on the X chromosome by

which homo may have become sapiens. Am. J. Hum. Genet.

58, 1109–1110.

110. Turner, G. (1996). Intelligence and the X chromosome.

Lancet 347, 1814–1815.

111. Hedges, L.V., and Nowell, A. (1995). Sex differences in

mental test scores, variability, and numbers of high-scoring

individuals. Science 269, 41–45.

112. Lubs, H.A. (1999). The other side of the coin: a hypothesis

concerning the importance of genes for high intelligence

and evolution of the X chromosome. Am. J. Med. Genet.

85, 206–208.

113. McMaster, G., and Trafzer, C. (2004). Native Universe, Voices

of Indian America (Washington, DC: Smithsonian and

National Geographic).

114. Turner, G., Boyle, J., Partington, M.W., Kerr, B., Raymond,

F.L., and Gecz, J. (2008). Restoring reproductive confidence

in families with X-linked mental retardation by finding the

causal mutation. Clin. Genet. 73, 188–190.

590 The American Journal of Human Genetics 90, 579–590, April 6, 2012

ARTICLE

On Sharing Quantitative Trait GWAS Resultsin an Era of Multiple-omics Data and the Limitsof Genomic Privacy

Hae Kyung Im,1,* Eric R. Gamazon,2 Dan L. Nicolae,2,3,4 and Nancy J. Cox2,3,*

Recent advances in genome-scale, system-level measurements of quantitative phenotypes (transcriptome, metabolome, and proteome)

promise to yield unprecedented biological insights. In this environment, broad dissemination of results from genome-wide association

studies (GWASs) or deep-sequencing efforts is highly desirable. However, summary results from case-control studies (allele frequencies)

have been withdrawn from public access because it has been shown that they can be used for inferring participation in a study if the

individual’s genotype is available. A natural question that follows is how much private information is contained in summary results

from quantitative trait GWAS such as regression coefficients or p values. We show that regression coefficients for many SNPs can reveal

the person’s participation and for participants his or her phenotype with high accuracy. Our power calculations show that regression

coefficients contain as much information on individuals as allele frequencies do, if the person’s phenotype is rather extreme or if

multiple phenotypes are available as has been increasingly facilitated by the use of multiple-omics data sets. These findings emphasize

the need to devise a mechanism that allows data sharing that will facilitate scientific progress without sacrificing privacy protection.

Introduction

Homer et al.1 showed that it is possible to detect an individ-

ual’s presence in a complex genomic DNA mixture even

when the mixture contains only trace quantities of his or

her DNA. The study considered the implications of its find-

ings, motivated originally as an application to forensic

science, in the context of genome-wide association studies

(GWASs) fromwhich aggregate allele frequencies for a large

number of markers were being made publicly available.

Shortly after this publication, a reduction in open access

to aggregate GWAS results was implemented. Jacobs et al.2

presented an improved method using a likelihood

approach and showed that disease status could be inferred

for participants of the study. Visscher et al.3 and Sankarara-

man et al.4 calculated power estimates to understand the

limits of individual detection from sample allele frequen-

cies. They showed that the power to detect membership is

determined by the ratio between the number of markers

and the number of participants in the study.

Wepresent amethod that can infer an individual’s partic-

ipation in a study when regression coefficients from

quantitative phenotypes are available. This problem is

especially relevant now that genome-wide system-level

measurements of quantitative phenotypes (transcriptome,

proteome, and metabolome) are being widely collected

and analyzed. Undoubtedly, disseminating results from

quantitative GWAS and deep-sequencing efforts could be

of enormous benefit to research groups working on related

traits. We explore several statistics that can discriminate

study participants from nonparticipants. Notably, we find

that the use of only the direction of effects (signs of the

coefficients) enables membership inference with good

accuracy. We show the results from applying the statistics

to the Genetics of Kidneys in Diabetes (GoKinD) data

set5,6 to illustrate the level of information contained in

aggregate data. We also provide quantification of the infor-

mation content by computing the power of the method.

Furthermore, we discuss a general framework that can be

used for integrating our findings and earlier studies of

genomic privacy based on sample allele frequencies. With

the increasing use of high-throughput technologies to inte-

gratemultiple-omics data sets, these various statistics result

in a more powerful approach to the identification problem

than with the use of a single phenotype.

Material and Methods

Let us assume that we have the estimated regression coefficients

for M independent SNPs, that we use data on n individuals in a

GWAS (test sample), and that we also have the allelic dosage for

n� individuals from a reference population such as HapMap7,8 or

1000 Genomes Project.9

Membership Inference MethodWe define a statistic (a function of available data) that has a

different distribution depending on the membership status and

use this difference to infer membership. We compute this statistic

for the individual of interest, I, and for all individuals in the refer-

ence population. If the statistic falls well within the reference

distribution we will conclude that the individual is not likely to

have participated in the study, and if the statistic falls in the

extremes of the distribution, we will conclude that the individual

did participate in the study.

1Department of Health Studies, University of Chicago, Chicago, IL, 60637, USA; 2Department of Medicine, University of Chicago, Chicago, IL, 60637, USA;3Department of Human Genetics, University of Chicago, Chicago, IL, 60637, USA; 4Department of Statistics, University of Chicago, Chicago,

IL, 60637, USA

*Correspondence: [email protected] (H.K.I.), [email protected] (N.J.C.)

DOI 10.1016/j.ajhg.2012.02.008. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 591–598, April 6, 2012 591

Let bY be defined as

bYI ¼ n

M

XMj¼1

bbj

�XI;j � bXj

�; (Equation 1)

where XI;j is the allelic dosage of individual I at SNP j, bbj is the

estimated coefficient from fitting the model Yi ¼ aj þ bjXi;j þ ei,

and bXj is the estimated mean of allelic dosage (twice the allele

frequency) for SNP j computed with the reference group.

Conditional Mean and Variance of bYThe expected value and the variance of the statistic bYI conditional

on the individual’s genotype XI and demeaned phenotype YI � m

and membership status (in or out) are as follows:

E½bY jXI ;YI ; in�zðYI � mÞE½bY jXI ;YI ; out�z0

Var½bY jXI ;YI ; in�zs2 n

M

Var½bY jXI ;YI ; out�zs2 n

M

; (Equation 2)

where s2 is the variance of the phenotype, and m is the population

mean of the phenotype Y. Note that for the method to work we do

not need to make use of these expressions nor do we need to know

s2 and m because we rely on the empirical distribution from the

reference population to determine membership. These expres-

sions will serve to estimate the power of the method.

Unconditional on YI, the variance of the statistic bY is given by

Var�bY� jXI ; inzs2:

In computing these quantities we assume that the number of

markers is much larger than the number of individuals in the

test sample and the number of individuals in the reference group:

M >> n >> 1 andM >> n� >> 1. Hardy Weinberg equilibrium is

assumed. To derive these expressions, we used standard Taylor

expansions and the law of iterative expectations. We tested the

validity of these for finite samples (n between 100 and 1,000 and

M=n between 1,000 and 50,000) by fitting linear regressions

with simulated genotypes and phenotypes and computing the

sample mean and variances of the bY statistic. See Supplemental

Data, available online, to find plots of the validation.

Power of the MethodTo compute power, we define the null and alternative hypothesis.

Under the null hypothesis the individual did not participate

in the study (nor did any relatives of the individual), whereas under

the alternativehypothesis, the individual didparticipate.Using the

mean and variance under the null hypothesis and the correspond-

ingmean and variance under the alternative hypothesis computed

in Equation 2 and assuming M >> n >> 1; M >> n� >> 1,

normality of the statistic bY , and the sign of YI � m to be known,

the power will be approximately given by

powerzF

jYI � m j

s

ffiffiffiffiffiM

n

r� za

!; (Equation 3)

where a is the type I error, zx ¼ F�1ð1� xÞ is the ð1� xÞ-quantile ofthe normal distribution, and F is the normal cumulative distribu-

tion function. If the sign of bY � m is not known, a two-sided test

will be used in the derivation and the power will be given by

powerzF

jYI � m j

s

ffiffiffiffiffiM

n

r� za=2

!: (Equation 4)

See derivation in Appendix A. Because F is a strictly increasing

function the power

d increases when M, the number of SNPs, increases

d decreases when n, the study’s sample size, increases

d increases when the individual’s phenotype deviates more

from the mean (scaled by the standard deviation)

d increases when a, the type I error, increases

To facilitate comparison with Visscher et al.3 and Sankararaman

et al.,4 let us express the one-sided power Equation 3 with the

following (equivalent) implicit formula

ðza þ zbÞ2z�YI � m

s

�2M

n; (Equation 5)

where 1� b is the power (note that in Sankararaman et al.4 b is

defined as the power). Recall that in Visscher et al.3 and Sankarara-

man et al.4 power was given implicitly by

ðza þ zbÞ2zM

n: (Equation 6)

Thus, the only difference between Equations 5 and 6 is the factor

ððYI � mÞ=sÞ2. If the phenotype of the person deviates more than

one standard deviation away from the mean, i.e., jYI � mj > s

and the sign of YI � m is known, the power when regression

coefficients are used is larger than it is when allele frequencies

are used. If the person’s phenotype is close to the mean, then

the power will be much diminished. Although expectations are

computed conditional on YI � m, we do not need to know its

magnitude in order to achieve this power. However, we do need

to know the sign of YI � m in order to keep the test one-sided.

If the sign is not used, jYI � mj would need to be 1þ �ðza=2�zaÞ=

ffiffiffiffiffiffiffiffiffiffiM=n

p �times greater than the standard deviation in order

to achieve greater power than the allele frequency case. As an

example, if a ¼ 0:05 and M=n ¼ 100, jYI � mj would need to be

greater than 1.031 times s.

Individual Contribution to the Regression CoefficientIn order to get an intuitive understanding of the contribution

of each individual from the sample, we can decompose the esti-

mated regression coefficient into roughly the sum of individual

contributions:

bbj ¼�~X

0j~Xj

��1~X

0j~Y

bbj z1

ns2j

~XI;j~YI þ 1

ns2j

XisI

~Xi;j~Yi

bbj z ~bI;j þPisI

~bi;j

; (Equation 7)

defining ~bi;j ¼ ð1=ns2j Þ ~Xi;j~Yi as the individual contribution to the

regression coefficient and s2j as the variance of the allelic dosage

(under Hardy Weinberg assumption s2j ¼ 2pjð1� pjÞ where pj is

the minor allele frequency of SNP j). We use the tilde ~X for the

demeaned variable that uses themean from the sample. It is worth

comparing with the decomposition for the case whenminor allele

frequencies for the sample are available: bpjzðpI;j=nÞ þP

isI ðpi;j=nÞ,where bpj is the sample minor allele frequency and pi;j is the allelic

dosage divided by 2 of individual i for SNP j. This similarity gives

an intuitive understanding of the corresponding similarity in the

dependence of power on the ratio of the number of SNPs and

sample size of the study.

592 The American Journal of Human Genetics 90, 591–598, April 6, 2012

Combining Multiple PhenotypesIf results from multiple phenotypes such as eQTL (or other omics

data) results are available, we can combine the information

regarding the individual’s membership by using a Fisher type of

method (the sum of logarithms of p values).10

For each phenotype k, we can compute an empirical p value, pk,

defined as the proportion of reference individuals with magnitude

of the jbY j greater than the individual’s jbYI j. We can combine

p values across different phenotypes by computing

�2Xnphenok¼1

log10 pk

where npheno is the number of phenotypes to be combined. In

addition to accumulating evidence across phenotypes, this

method avoids the problem of lack of power due to one particular

phenotype being close to the population mean.

Covariate AdjustmentUsually other covariates such as age, sex, etc. are adjusted for

when performing GWASs. If the allelic dosage is independent of

the covariates (as will likely be the case for most SNPs) bY will

converge to the covariate-adjusted phenotype instead of the actual

phenotype. The standard deviation might change if the covariates

explain a substantial portion of the phenotypic variability.

However, the method will still work because under no participa-

tion bY will still be around 0, whereas if the individual participated

in the study, bY will converge to the covariate-adjusted phenotype.

Themethod does not require knowing the actual phenotype and it

will work relative to this adjusted phenotype. For the purpose of

re-identification using our method, the presence of covariates is

only a nuisance and no additional power is achieved when they

are present.

Sample Correlation StatisticEquation 7 suggests that the sample correlation between the esti-

mated beta and the individual’s genotype might be useful because

we would expect the correlation to be 0 if the individual was not in

the sample and different from 0 if the individual was part of the

study.

bC ¼

PMj¼1

�bbj � b��XI;j � bXj �XI � bX�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

j

�bbj � b�2P

j

�XI;j � bXj �XI � bX�2s ;

where the long bar above an expressionmeans the samplemean of

the expression.

Sign StatisticEquation 7 also shows that the sign of the correlation coefficient

will be slightly more likely to match the sign of the demeaned

allelic dosage if the person participated in the study than other-

wise. Let bS be defined as:

bS ¼XMj¼1

sign�bb� sign�Xi;j � bXj

�We expect that strictly more than 50% of the times the product

signðbbÞ signðXi;j � bXjÞ will be positive (or negative) if the indi-

vidual participated in the study and his or her phenotype is above

(or below) average. By looking at the absolute value of the sign

statistic we expect to gain information on whether the individual

was part of the study or not.

Analysis DetailsWe used the PLINK software11 and filtered out SNP markers that

were not in Hardy Weinberg equilibrium (p < 0.001) and those

that had minor allele frequencies less than 5%. Receiver operating

characteristic (ROC) curves were generated by using the absolute

value of the statistic as the predicting variable and membership

in the sample as the labels by using the ROCR12 package for the

R statistical package.13 We used only individuals who self-reported

as white both for sample and reference.

Results

We show the performance of the statistics defined inMate-

rial and Methods ðbY ; bS; bCÞ by using data from the GoKinD

(Genetics of Kidney Disease) study.5,6 The data set was

downloaded from dbGaP14 and consisted of more than

1,800 probands with long-standing type 1 diabetes, over

300 dichotomous and quantitative phenotypes, and geno-

type from Affymetrix Genome-Wide Human SNPArray 5.0

platform.We used a subset of 1,644 individuals reported to

be Caucasian.

We show results for two of the phenotypes: cholesterol

level and body mass index (BMI). We also tested the

method on a third simulated phenotype and found at least

as good performance. The latter demonstrates that the

method does not depend on any real effect of genotype

on phenotype.

We randomly sampled 100, 500, and 1,000 individuals

from each study’s cohort and performed a GWAS including

only individuals from each random sample. The remaining

individuals were used as reference group. The statistics

ðbY ; bS; bCÞ were computed for both sample and reference

individuals.

Identifiability Statistic and Phenotype Reconstruction

Figure 1 shows bY versus the actual phenotype (rank

normalized cholesterol levels). The blue dots correspond

to individuals in the sample and the black dots correspond

to individuals in the reference group. For individuals in the

sample, bY lies close to the one-to-one line (perfect predic-

tion line), whereas the individuals in the reference popula-

tion lie close to a flat line around 0 (consistent with our

calculations of mean and variances). The sample size was

n ¼ 1; 000 and the number of SNPs was M ¼ 300;000.

The number of reference individuals was 644.

This demonstrates that for individuals who participated

in a study, their phenotype can be reconstructed with high

accuracy using the bY statistic, whereas for nonparticipants

what we get is mostly noise.

Distribution of Statistic by Membership Status

and ROC Analysis

The left panel in Figure 2 shows the distribution of the

absolute value of bY by membership status. As in Figure 1

The American Journal of Human Genetics 90, 591–598, April 6, 2012 593

nonmembers’ values lie close to 0, whereas members’

values are distributed in a large range of values. This differ-

ence in distributions is what will allow us to discriminate

between members and nonmembers.

The right panel shows the ROC curve, the true positive

rate (sensitivity or power) versus the false positive rate

(1-specificity or type I error) when we use jbY j to predict

membership. A good test should yield a high true positive

rate (¼ sensitivity or power) while keeping the false posi-

tive rate low (¼ 1-specificity or type I error); ideally the

area under the curve (AUC) should be close to 1. For

300,000 SNPs and a sample size of 1,000, the AUC was

0.83, which is much greater than 0.5, showing clear

discrimination power. The poor performance relative to

the allele frequency case is due to the fact that we do not

assume the sign of the deviation from the mean to be

known and that the phenotype values of some of the indi-

viduals in the test sample are close to the mean. Recall

from Equation 3 that power (which is not equal to AUC

but is a related measure of performance) is an increasing

function of the absolute value of the difference between

the phenotype and the mean. For average individuals

(phenotype close to the mean) this method does not

provide discrimination power.

Predictive Performance as Function of M/n

Figure 3 shows the area under the curve for different values

of sample size (n) and number of SNPs (M). Consistent with

our power calculation, we observe increasing performance

as the ratio of number of SNPs to sample size increases.

SNPs were chosen randomly from the full set of available

SNPs. The lower AUC for larger sample sizes is probably

because the independence of markers assumption fails

more dramatically as the total number of markers

increases.

Performance of Other Statistics and Their Information

Content

Figure 4 shows the distribution and performance of the

sign statistic. The left panel shows the distribution of the

sign statistic by membership status. The right panel shows

the ROC curve when we use the absolute value of the sign

statistic to predict membership. Notice that the area under

the curve is 0.75, which still shows good discrimination

power. This result suggests that a large portion of the

information regarding the individual’s participation is

contained in the signs.

The performance of the correlation statistic is almost

identical to the performance of bY as one might have ex-

pected.

Covariate Adjustments

Figure 5 shows the ROC curve for bY with rank normalized

cholesterol levels as phenotype and sex and age as

covariates in addition to allelic dosage. Note that the

performance has not changed by adding the additional

covariates. This was expected because our method is based

on ‘‘over fitting’’ of the data.

In general access to the covariates or phenotypes for the

participants is not available and so we did not attempt to

improve our method by using them. If the allelic dosage

is independent of the covariates (as will likely be the case

for most SNPs), bY will converge to the covariate-adjusted

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

phenotype

yhat

Yhat vs. Y−mean

n = 1000M/n = 300

Figure 1. bY versus YbY versus the actual phenotype (cholesterol levels with normal-izing transformation applied). The blue dots correspond to indi-viduals in the sample and the black dots correspond to individualsin the reference group. For individuals in the sample bY lies close tothe one-to-one line, whereas the individuals in the reference pop-ulation lie close to a flat line around 0. The sample size was 1,000and number of SNPs was 300,000.

Reference Sample

0.0

1.0

2.0

3.0

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

n = 1000M/n = 300

AUC = 83

Figure 2. bY Distribution by Membership Status andPerformance(Left panel) The distribution of the absolute value of bY bymembership status. As in Figure 1 nonmembers’ values lie closeto 0, whereas the values for participants are distributed similar tothe actual phenotype.(Right panel) The ROC curve, the true positive rate (sensitivity)versus the false positive rate (1-specificity) when we use jbY jto predict membership. A good test should yield a high true posi-tive rate (sensitivity) while keeping the false positive rate low(1-specificity); ideally the AUC should be close to 1. For 300,000SNPs and a sample size of 1,000, the AUC was 0.83, which isreasonably close to 1.

594 The American Journal of Human Genetics 90, 591–598, April 6, 2012

phenotype, and our method will work relative to this

adjusted phenotype. We do not expect the inclusion of

covariates to affect the performance of the method. Also

note that our method relies on ‘‘over fitting’’ of the data

that occurs for individuals in the sample and not on any

real relationship between genotype and phenotype. As

previously mentioned, we found that the method worked

equally well when a simulated phenotype was used.

Multiple Phenotypes

To illustrate the effect of combiningmore than one pheno-

type, we applied the Fisher type method (the sum of the

log of empirical p values, see details in Methods) to choles-

terol and Body Mass Index (BMI) regression coefficients.

Figure 6 shows the ROC curves when single phenotypes

were used compared to the curve when both were com-

bined. Clearly, the combined method outperforms both

single-phenotype methods. The AUC for each phenotype

was 83% and 87%, whereas the combined AUC is 95%.

The performance should improve as the number of pheno-

types increases.

Discussion

Given the increasing number of large-scale data sets in

which very large numbers of phenotypes will be subject

to GWAS or sequencing studies, it is of great interest to

quantify the level of participant’s private data contained

in aggregate results. The insights gained from our study

should be helpful in devising methods to facilitate broad

dissemination of study results without compromising the

participant’s privacy.

We present three statistics that can discriminate between

individuals who participated in a study and those who did

not. We show the performance of themethod by using real

data from the GoKind GWAS. We also provide an approx-

imate estimate of the power of the method when bY (the

average of the regression coefficients times the allelic

dosage) is used. Power is determined by the ratio between

the number of markers and the sample size of the study,

much like when allele frequencies are available. But the

power is also modulated by the deviation from the mean

of the individual’s phenotype. This indicates that for indi-

viduals with extreme phenotypes (e.g., as expected from

certain study designs), more power can be achieved

(asymptotically) through the use of the regression coeffi-

cients than through the use of allele frequencies. But for

a person with an average phenotype the method provides

no power, which is expected because the average person

contributes very little to the estimate of the regression

coefficients. In an earlier study, Lumley and Rice15 consid-

ered the possibility that aggregate results from GWAS can

reveal a participant’s phenotype with high accuracy, even

for quantitative phenotypes. However, the problem of

phenotype reconstruction (the subject of Lumley et al.’s

Commentary on quantitative traits15) for a participant of

a study and the problem of identifiability are distinct prob-

lems; furthermore, the problem of identifiability was not

theoretically explored. Here we quantified the power of

our identification method for quantitative traits, demon-

strated the existence of various statistics that can detect

the presence of individual genotypes from summary

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Performance

M/n

AU

C

sample size=100sample size=500sample size=1000sample size=1001sample size=1002

Figure 3. Performance by Sample Size and Number of MarkersThe plot shows the area under the curve for different values ofsample size (n) and number of SNPs (M). Consistent with thepower calculation, we observe increasing AUC as the ratio ofnumber of SNPs to sample size increases. The lower AUC forsample sizes of 1,000 is probably due to a more pronounced effectof linkage disequilibrium as we use more markers.

Reference Sample

010

3050

70

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

n = 1000M/n = 300

AUC = 75

Figure 4. Sign Statistic Distribution and PerformanceThe left panel shows the distribution of the sign statistic bymembership status. The right panel shows the ROC curve whenwe use the absolute value of the sign statistic to predict member-ship. The area under the curve is 0.75, a bit lower than the AUCwhen the actual estimated coefficients are used, but it still showsgood discrimination. This suggests that a large portion of theinformation regarding individual’s membership is containedin the signs rather than in the absolute value of the regressioncoefficient.

The American Journal of Human Genetics 90, 591–598, April 6, 2012 595

data, and sought to provide a general framework for

comparing the power with earlier studies3,4 of genomic

privacy based on sample allele frequencies.

The approximate decomposition of an individual contri-

bution to the regression coefficients gives us an intuitive

understanding of the level of information contained in

these aggregate data. This decomposition shows the struc-

tural similarity with the case in which allele frequencies are

used to infer membership.

Even though we do not claim that our method provides

optimal discrimination, the striking similarity between our

expression for power and the one obtained by Visscher

et al.3 and Sankararaman et al.4 leads us to believe that it

might not be far from optimal. In addition, the similarity

between an individual contribution to the regression coef-

ficients and the contribution to the sample allele

frequency adds credence to our hypothesis.

Tests on several other GWAS data sets yielded similar

results. As expected, we also found that the performance

depends on the homogeneity of the study participants.

Population structure would need to be taken into account

if the GWAS results included a heterogeneous cohort.

Although not presented here, we have seen that the bYhas a larger magnitude for relatives of study participants

than for the reference population. Thus, the method pre-

sented here should be applicable to determine whether

relatives of the individual participated in the study, albeit

with reduced power.

We have derived and applied our method to an additive

model but extension to other models (recessive, dominant,

etc.) should be straightforward.

It is interesting to note that by using only the signs of the

regression coefficients, we still maintain a large portion of

the discrimination power of the method. We have seen

similar effects in other data sets. One practical implication

of this finding is that reducing the number of decimals

in the published regression coefficients would not be an

effective method to protect privacy.

If p values and signs were available, then regression

coefficients could be computed and our method would

identify participants. If only the p values are available,

the absolute values of the regression coefficients can be

calculated. The sign statistic suggests that we might be

able to guess the sign of the regression coefficient slightly

more often than 50% of the times. This would in principle

allow us to compute bY . However, the power is likely to be

substantially reduced.

It is worth noting that the ability to predict the pheno-

type using bY and to infer membership is not related to

any real effect of genotype on phenotype. We have seen

that the method works as well or better with simulated

phenotypes. We note that genotypic information is being

used to infer study membership and to reconstruct trait

value used in the estimates of regression coefficients; no

prediction of phenotypic status in new individuals is being

done.

Sensitivity and specificity give us information on the

probability of false positives or false negatives given the

individual participated in the study. In many cases, it

might be more relevant to look at false positive or negative

rates provided the individual was positive or negative ac-

cording to our testing method. These are represented by

positive or negative predictive values. The positive predic-

tive value can become very small if the prior probability

of the individual participating in the study is very low.

For example, if all we know about the individual is the

person’s gender, this probability could be as low as 10�5

or 10�6 (e.g., 1,000 participants out of 159 million male

Reference Sample

0.0

1.0

2.0

3.0

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

n = 1000M/n = 300

AUC = 83

Figure 5. Performance with Covariate AdjustmentThis figure shows the ROC curve for bY with rank normalizedcholesterol levels as phenotype and sex, age, and allelic dosageas covariates. Note that the performance is not changed by addingthe additional covariates. False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Performance

CholesterolBMIBoth

Figure 6. Performance with Multiple PhenotypesTo illustrate the effect of combiningmore than one phenotype, weapplied the Fisher type method (the sum of the log p values) tocholesterol and BMI regression coefficients. This figure showsthe ROC curves when each one of the phenotypes was usedcompared to the curve when both were combined. Clearly, thecombined method outperforms both single-phenotype methods.The AUC for each phenotype was 83% and 87%, whereas thecombined AUC is 95%.

596 The American Journal of Human Genetics 90, 591–598, April 6, 2012

individuals from the USA). In this context, given that

the individual was positive in the test, the false negative

rate might still be very high. Naturally, because investiga-

tors have no control over how much prior information

someone can come up with, this argument cannot be

used to ignore the possible breach of confidentiality.

Results from massively parallel sequencing (in the

form of low frequency or rare genetic variations) might

enable increased power of identification. If results from

multiple phenotypes are available, as would be the case

if, for example, gene expression associations were also

conducted (and accompanying results made available),

the information from each phenotype can be combined

to achieve much greater power as suggested by the results

from combining just two phenotypes. Although the

single-phenotype method has no power for individuals

with an average phenotype, it is unlikely a person will

have an average phenotype for all the phenotypes

considered.

A recent study16 of temporal trends in the availability of

results from GWAS classified published studies according

to level of risk for potential misuse and highlights the

ongoing importance of clearer guidelines on how ‘‘data

products’’ can be appropriately shared.

With the increasing trend to collect and analyze

multiple-omics data, the need to share large amounts of

quantitative GWAS results becomes more urgent. In addi-

tion, given our finding that multiple phenotypes can be

combined to increase the power to infer membership, pro-

tecting privacy by limiting the number of significant hits

published is becoming less feasible.

Because fluid sharing of results among researchers for

legitimate scientific use would be highly desirable, our

study emphasizes the urgent need to devise protocols

and methods that facilitate this process without compro-

mising a participant’s privacy.

One mechanism to address this problem would be to

implement an annual certification process, which would

grant the certified researcher unrestricted access to study

results with the condition that the data could only be

used for research goals that do not compromise the partic-

ipants’ privacy. A researcher who does not abide by these

rules could be penalized by withdrawing further access

to data.

Appendix A

Power Calculation

To compute power, we use the same assumptions as for the

conditional mean and variance, i.e., that the number of

markers is much larger than the number of individuals

in the test sample and the number of individuals in the

reference group: M >> n >> 1 andM >> n� >> 1. Hardy

Weinberg equilibrium is assumed. Under these assump-

tions, it can be shown that bY converges to a normal variate

with mean and variance given in Equation 2.

We define the null and alternative hypothesis as follows.

Under the null hypothesis, the individual did not partici-

pate in the study (nor did any relatives of the individual),

whereas under the alternative hypothesis, the individual

did participate.

If the method uses the sign of the difference YI � m,

and we assume that the difference is greater than 0, we

will reject the null hypothesis if bYI is greater than

zasffiffiffiffiffiffiffiffiffiffin=M

p, where a is the type I error and za is the ð1� aÞ

quantile of the normal distribution. The power will be

given by the probability under the alternative thatbYI > zasffiffiffiffiffiffiffiffiffiffin=M

ppower ¼ Pin

�bYI > zas

ffiffiffiffiffin

M

r �

¼ 1� F

0BBB@zas

ffiffiffiffiffin

M

r� ðYI � mÞ

s

ffiffiffiffiffin

M

r1CCCA

(Equation 8)

¼ 1� F

za � YI � m

s

ffiffiffiffiffiM

n

r !(Equation 9)

¼ F

YI � m

s

ffiffiffiffiffiM

n

r� za

!(Equation 10)

where in Equation (8) we have used the fact that bYI is nor-

mally distributed with mean YI � m and variance s2n=M

and in Equation (10) we have used the property of the

normal CDF FðxÞ ¼ 1� Fð�xÞ.If YI � m < 0, similar arguments will give

power ¼ F

�ðYI � mÞ

s

ffiffiffiffiffiM

n

r� za

!:

Thus more generally we have

power ¼ F

jYI � m j

s

ffiffiffiffiffiM

n

r� za

!: (Equation 11)

If the sign of the difference YI � m is not used, the rejec-

tion region will be defined as jbYI j > za=2sffiffiffiffiffiffiffiffiffiffin=M

p. The

alternative distribution will be an equally weighted

mixture of normal distributions with means jYI � mj and�jYI � mj. Note that any weight other than 1/2 would

mean that we have information on whether it is more

likely that the sign is positive or negative. For example, if

we knew it was more likely to be positive, then we would

give higher weight to the normal distribution with mean

jYI � mj. The power when we do not make use of the sign

of jYI � mj is given by

power ¼ Pin

�j bYI j > za=2s

ffiffiffiffiffin

M

r �¼ Pin

�bYI > za=2s

ffiffiffiffiffin

M

r �þ Pin

�bYI < �za=2s

ffiffiffiffiffin

M

r �(Equation 12)

The American Journal of Human Genetics 90, 591–598, April 6, 2012 597

¼ 1

2

0BBB@1� F

0BBB@za=2s

ffiffiffiffiffin

M

r� jYI � m j

s

ffiffiffiffiffin

M

r1CCCA1CCCA

þ 1

2F

0BBB@�za=2s

ffiffiffiffiffin

M

rþ jYI � m j

s

ffiffiffiffiffin

M

r1CCCA

(Equation 13)

¼ 1

2F

0BBB@�za=2s

ffiffiffiffiffin

M

rþ jYI � m j

s

ffiffiffiffiffin

M

r1CCCA

þ 1

2F

0BBB@�za=2s

ffiffiffiffiffin

M

rþ jYI � m j

s

ffiffiffiffiffin

M

r1CCCA

(Equation 14)

¼ F

jYI � m j

s

ffiffiffiffiffiM

n

r� za=2

!: (Equation 15)

Supplemental Data

Supplemental Data include two figures and can be found with this

article online at http://www.cell.com/AJHG/.

Acknowledgments

This work was supported by the Genotype-Tissue Expression

project (R01 MH090937) and the University of Chicago DRTC

(Diabetes Research and Training Center; P60 DK20595). The Go-

KinD study was conducted by the GoKinD investigators and sup-

ported by the Juvenile Diabetes Research Foundation, the Centers

for Disease Control, and the Special Statutory Funding Program for

Type 1 Diabetes Research administered by the National Institute of

Diabetes and Digestive and Kidney Diseases (NIDDK). This manu-

script was not prepared in collaboration with Investigators of the

GoKinD study and does not necessarily reflect the opinions or

views of the GoKinD study or the NIDDK.

Received: November 20, 2011

Revised: January 11, 2012

Accepted: February 8, 2012

Published online: March 29, 2012

References

1. Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W.,

Muehling, J., Pearson, J.V., Stephan, D.A., Nelson, S.F., and

Craig, D.W. (2008). Resolving individuals contributing trace

amounts of DNA to highly complex mixtures using high-

density SNPgenotypingmicroarrays. PLoSGenet. 4, e1000167.

2. Jacobs, K.B., Yeager, M., Wacholder, S., Craig, D., Kraft, P.,

Hunter, D.J., Paschal, J., Manolio, T.A., Tucker, M., Hoover,

R.N., et al. (2009). A new statistic and its power to infer

membership in a genome-wide association study using geno-

type frequencies. Nat. Genet. 41, 1253–1257.

3. Visscher, P.M., and Hill, W.G. (2009). The limits of individual

identification from sample allele frequencies: theory and

statistical analysis. PLoS Genet. 5, e1000628.

4. Sankararaman, S., Obozinski, G., Jordan,M.I., and Halperin, E.

(2009). Genomic privacy and limits of individual detection in

a pool. Nat. Genet. 41, 965–967.

5. Pluzhnikov, A., Below, J.E., Konkashbaev, A., Tikhomirov, A.,

Kistner-Griffin, E., Roe, C.A., Nicolae, D.L., and Cox, N.J.

(2010). Spoiling the whole bunch: quality control aimed at

preserving the integrity of high-throughput genotyping. Am.

J. Hum. Genet. 87, 123–128.

6. Manolio, T.A., Rodriguez, L.L., Brooks, L., Abecasis, G., Ballin-

ger, D., Daly, M., Donnelly, P., Faraone, S.V., Frazer, K., Gabriel,

S., et al; GAIN Collaborative Research Group; Collaborative

Association Study of Psoriasis; International Multi-Center

ADHD Genetics Project; Molecular Genetics of Schizophrenia

Collaboration; Bipolar Genome Study;Major Depression Stage

1 Genomewide Association in Population-Based Samples

Study; Genetics of Kidneys in Diabetes (GoKinD) Study.

(2007). New models of collaboration in genome-wide associa-

tion studies: the Genetic Association Information Network.

Nat. Genet. 39, 1045–1051.

7. International HapMap Consortium. (2003). The international

hapmap project. Nature 426, 789–796.

8. Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve, L.L.,

Gibbs, R.A., Belmont, J.W., Boudreau, A., Hardenbol, P., Leal,

S.M., et al; International HapMap Consortium. (2007). A

second generation human haplotype map of over 3.1 million

SNPs. Nature 449, 851–861.

9. 1000 Genomes Project Consortium. (2010). A map of human

genome variation from population-scale sequencing. Nature

467, 1061–1073.

10. Fisher, R. (1925). Statistical Methods for Research Workers,

Fifth Edition (Edinburgh: Oliver and Boyd).

11. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira,

M.A.R., Bender, D., Maller, J., Sklar, P., de Bakker, P.I.W.,

Daly, M.J., and Sham, P.C. (2007). PLINK: a tool set for

whole-genome association and population-based linkage

analyses. Am. J. Hum. Genet. 81, 559–575.

12. Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. (2005).

ROCR: visualizing classifier performance in R. Bioinformatics

21, 3940–3941.

13. R Development Core Team. (2010). R: A Language and

Environment for Statistical Computing (Vienna: R Founda-

tion for Statistical Computing).

14. Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K.,

Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L.,

et al. (2007). The NCBI dbGaP database of genotypes and

phenotypes. Nat. Genet. 39, 1181–1186.

15. Lumley, T., and Rice, K. (2010). Potential for revealing indi-

vidual-level information in genome-wide association studies.

JAMA 303, 659–660.

16. Johnson, A.D., Leslie, R., and O’Donnell, C.J. (2011).

Temporal trends in results availability from genome-wide

association studies. PLoS Genet. 7, e1002269.

598 The American Journal of Human Genetics 90, 591–598, April 6, 2012

ARTICLE

Resolving the Breakpointsof the 17q21.31 Microdeletion Syndromewith Next-Generation Sequencing

Andy Itsara,1 Lisenka E.L.M. Vissers,2,3 Karyn Meltz Steinberg,1 Kevin J. Meyer,4 Michael C. Zody,5

David A. Koolen,2,3 Joep de Ligt,2,3 Edwin Cuppen,6,7 Carl Baker,1 Choli Lee,1 Tina A. Graves,8

Richard K. Wilson,8 Robert B. Jenkins,4 Joris A. Veltman,2,3 and Evan E. Eichler1,9,*

Recurrent deletions have been associatedwith numerous diseases and genomic disorders. Few, however, have been resolved at themolec-

ular level because their breakpoints often occur in highly copy-number-polymorphic duplicated sequences.We present an approach that

uses a combination of somatic cell hybrids, array comparative genomic hybridization, and the specificity of next-generation sequencing

todeterminebreakpoints that occurwithin segmental duplications. Applyingour technique to the17q21.31microdeletion syndrome,we

used genome sequencing to determine copy-number-variant breakpoints in three deletion-bearing individualswithmolecular resolution.

For two cases, we observed breakpoints consistent with nonallelic homologous recombination involving only H2 chromosomal haplo-

types, as expected. Molecular resolution revealed that the breakpoints occurred at different locations within a 145 kbp segment

of >99% identity and disrupt KANSL1 (previously known as KANSL1). In the remaining case, we found that unequal crossover occurred

interchromosomally between theH1 andH2haplotypes and that this eventwasmediated by ahomologous sequence thatwas once again

missing from the human reference. Interestingly, the breakpoints mapped preferentially to gaps in the current reference genome

assembly, which we resolved in this study. Our method provides a strategy for the identification of breakpoints within complex regions

of the genomeharboringhigh-identity and copy-number-polymorphic segmental duplication. The approach should become particularly

useful ashigh-quality alternate reference sequences becomeavailable andgenome sequencingof individuals’DNAbecomesmore routine.

Introduction

Structural variation, including copy-number variation,

accounts for a significant proportion of human genetic

diversity.1–4 A notable feature of copy-number variation is

the potential for recurrent events to occur at ‘‘hotspots’’

within the human genome as a resultof nonallelic homolo-

gous recombination (NAHR) between repetitive sequences.

Most notable in this regard are segmental duplications

(SDs)—contiguous regions (>1 kbp) with high sequence

identity (>90%).5,6 Recurrent, de novo copy-number vari-

ants (CNVs) have been associated with a variety of pheno-

types, including schizophrenia (MIM 181500),7 autism

(MIM 209850),8 epilepsy (MIM 604827),9 intellectual

disability,10 congenital anomalies (MIM 612474 and

187500),11,12 severe obesity (MIM 613444),13 and renal

disease (MIM 137920).14

Although there have been significant advances in CNV

discovery and genotyping, precise breakpoint delineation

within SDs remains challenging. This information is,

however, essential if we are to further our fundamental

understanding of genome plasticity and processes under-

lying genomic rearrangements. Traditionally, breakpoint

resolution of genomic rearrangements required a combina-

tion of pulse-field gel electrophoresis and Southern blot

analysis to reveal an atypical hybridizing band that

harbored the breakpoint of interest.15,16 Sequence-level

breakpoint identification of the genome has advanced

considerably with more modern molecular methods that

leverage the high quality of the human reference

genome.17 For unique regions, the procedure is relatively

straightforward and typically includes array comparative

genomic hybridization (arrayCGH) followed by long-range

PCR,18 subcloning, and direct Sanger sequencing.19,20

More recently, next-generation methods have allowed

researchers to rapidly capture breakpoints by using split-

read21 and paired-end-read mapping approaches.19,20,22

In contrast, few breakpoints mapping to repetitive

regions, particularly those with large and highly identical

duplications (>10 kbp and >95%), have been cloned

and sequenced.16,23 Unlike unique regions, breakpoints

that map to repeated sequences are much more problem-

atic. Array CGH is unable to localize CNV breakpoints

within blocks of near-perfect sequence identity, which

may span hundreds of kilobases, because of probe cross-

hybridization. Long-range PCR is relatively ineffective

over such large distances of high sequence identity. Simi-

larly, paired-end-read or split-read approaches generally

fail to identify the breakpoints because of short library

inserts and short read lengths that cannot successfully

1Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA; 2Department of Human Genetics, Nijmegen Centre for Molecular

Life Sciences, Radboud University NijmegenMedical Centre, Nijmegen, The Netherlands; 3Institute for Genetic andMetabolic Disease, Radboud University

NijmegenMedical Centre, Nijmegen, The Netherlands; 4Division of Laboratory Genetics, Department of Laboratory Medicine and Pathology, Mayo Clinic,

Rochester, MN 55905, USA; 5Broad Institute, Cambridge, MA 02142, USA; 6Hubrecht Institute, University Medical Center Utrecht, 3584 CT, The

Netherlands; 7Royal Netherlands Academy of Arts and Science, NL-1000 GC Amsterdam, The Netherlands; 8The Genome Institute at Washington Univer-

sity, Washington University School of Medicine, St. Louis, MO 63108, USA; 9Howard Hughes Medical Institute

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.013. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 599–613, April 6, 2012 599

traverse the distances needed to anchor PCR primers to

unique identifiers on either side of the breakpoint. Break-

point resolution is further complicated by both structural

polymorphisms and gaps in the human genome reference

sequence, which often occur precisely at the breakpoints of

interest. Such differences make determination of the true

breakpoint particularly difficult because both variation

and sequences exist at these sites, which are not present

in the human reference sequence.

Here, we present an approach for determining sequence-

level breakpoints occurring within SDs by using a

combination of somatic cell hybrids, array CGH, and

high-throughput sequencing. We take advantage of the

specificity of next-generation sequencing data and the

fact that large duplicated sequences with near-perfect

sequence identity will still carry hundreds of sequence

variants that distinguish the copies. A singly unique nucle-

otide (SUN) identifier is defined as a paralogous sequence

variant (PSV) that tags a specific sequence paralog by

uniquely distinguishing it from all other paralogs in the

human genome. Such variants allow for interrogation of

individual paralogs that are otherwise difficult to distin-

guish. In practice, SUNs are identified from next-genera-

tion sequencing data with SUN k-mers (SUNKs), sequences

that have length k and map to exactly one genomic

location containing one or more SUNs. Previously, we

developed a catalog of these variants, and here we apply

them to define breakpoints24 in individuals. We examine

recurrent microdeletions on 17q21.31, one of the most

structurally complex regions of the genome, as a model

locus. Structural variation at this locus has been exten-

sively characterized, most notably in haplotype-specific

sequence assemblies of the H1 and H2 haplotypes, making

the locus ideal for further study.25,26

Material and Methods

H2 Reference AssemblyAnalysis of the H1 and H2 haplotypes was based on previously

reported haplotype-specific sequence assemblies.26

Generation of Somatic Cell HybridsSomatic cell hybrids were generated at MayoMedical Laboratories.

After electrofusion of Epstein-Barr Virus (EBV) cells with E2 cells,

mouse-human hybrid colonies were observed at 18 days. Subse-

quently, 88 clones were selected for initial expansion and genotyp-

ing. Six A and six B chromosome 17 homologs were selected for

additional subculture. At pass three, all 12 hybrid clones were

tested for chromosome 17 by FISH. On the basis of the FISH

results, two A and two B hybrid clones were selected for confirma-

tory genotyping, and all cases confirmed retention of the appro-

priate A or B genotype. This study was approved by the institu-

tional review board of the University of Washington and

Radboud University, and all subjects provided informed consent.

Sample GenotypingAs previously described,27 H1/H2 genotyping was determined via

gel electrophoresis on the basis of a deletion in intron 9 of MAPT.

After generation of somatic cell hybrids, initial confirmatory

genotyping was performed at Mayo Medical Laboratories

(AFMa061za9, AFM192yh2, AFMa154za9, and AFM044xg3). Addi-

tional markers, AFM298wg5, AFMb364yh9, AFM155xd12, and

AFMa110wb5, were identified as being close to the 17q21.31 dele-

tion on the basis of the Marshfield genetic map,28 and these were

subsequently genotyped at the University of Washington with the

primers specified in the UniSTS marker database. To examine

microsatellites within SDs, we chose a subset of the reported

markers and primers used in a previously reported BAC assembly

of the H2 haplotype.25 After amplification, all microsatellite geno-

types were determined with an ABI 3730 DNA analyzer. All

primers used are listed in Table S1.

Haplotype-Specific Array CGHBy using hybrid cell line DNA, we performed array CGH to

compare the H1, H2, and 17q21.31 deletion-bearing chromo-

somes to one another. Because the hybrid cell lines are haploid

for human chromosome 17, unique regions of the human genome

removed by deletion have an extremely low signal, corresponding

to copy number 0. In contrast, deletions within SDs, regions of the

genome for which there exist additional paralogs, display interme-

diate levels of signal loss proportional to the number of paralogous

copies elsewhere in the genome. Although a mouse genome is

present in hybrid cells, we expected minimal cross-hybridization

because even single mismatches are known to affect probe hybrid-

ization,29–31 and at exons within 17q21.31, the average human-

mouse identity is ~85%, corresponding to nine mismatches on

a 60 bp probe.32 Finally, we visualized array CGH data on the H2

haplotype by remapping probes.26

Array Design and AnalysisWe designed a custom 244K Agilent array specifically to interro-

gate 17q21.31 contained within hybrid cell lines (Table S2; GEO

accession code GSE34867). At the deletion locus and flanking

sequence (NCBI build 36, chr17:40.25M–42.75M), probes were

placed at high density at 1 probe per 100 bp. Sample labeling

was achieved with Roche NimbleGen Dual-Color DNA Labeling

kits according to the manufacturer’s protocol, but half (500 ng)

the input DNA was used, and the protocol was scaled appropri-

ately. For array hybridization, 25 ng each of labeled test and

reference DNA was then brought to a 158 ml volume. Subse-

quently, the labeled DNA was hybridized to a custom Agilent

array according to the Agilent hybridization protocol. In brief,

the recommended hybridization master mix for a 13 microarray

was prepared and added to the labeled DNA, and hybridization

at 65�C on a rotator rack (20 rpm) followed for 72 hr. Array

wash and scanning proceeded according to the manufacturer’s

protocol. However, feature extraction was carried out with a

normalization set consisting of probes on human chromosome

17 but outside of 17q21.31.

Array CGH oligonucleotide probes were remapped to the H2

assembly with BLAST (blastn parameters �e 1e�10 �m 8 �W

7).33 Partial BLAST hits were extended without gaps to encompass

the entire probe sequence, and probes with no BLAST hits were

aligned with JAligner (see Web Resources), an implementation of

the Smith-Waterman algorithm (NUC.4.4 matrix; gap open and

extension penalties were equal to 10). Finally, probes weremapped

to a given location on the H2 assembly if and only if the global

alignment mapped with a %1 bp mismatch and a %1 bp gap.

Using these criteria, we mapped 11,967 distinct probes to 18,914

positions in the H2 assembly. To calculate the haploid copy

600 The American Journal of Human Genetics 90, 599–613, April 6, 2012

number of probes mapping to the H2 assembly, we aligned each

probe to the human genome (build 36), mouse genome (mm8),

and the H2 assembly by using BLAST (with the same parameters

as those used in probe mapping). To avoid double-counting

between the human genome and the H2 assembly, we excluded

human genome BLAST hits to the 17q21 deletion region (chr17:

40799295–42204344). To provide a ceiling on the copy number

of a given probe, we defined a probe’s copy number as the number

of BLAST hits covering R90% of the probe with %3 mismatches

and%1 bp gap. Consistent with a tendency to overestimate probe

copy number, for the 3,231 probes that were within the H2

assembly between 700,000–1,000,000 bp, a region predicted to

be almost entirely unique sequence in a haploid human genome,

99% (3,186/3,231) of probes were predicted to have a copy

number of 1, and the remaining probes were predicted to have

a copy number >1.

We determined copy-number loss at each probe given NAHR

between a particular pair of paralogous sequences. The expected

relative copy number for a given probe was defined as the copy

number of a probe after the deletion divided by the estimated

probe copy number in the H2 assembly. We compared expected

changes in relative copy number to observed log2 ratios to deter-

mine the most likely pair of paralogous sequences mediating

each deletion (Figure S1B).

Gap ClosureTo close gap 2, we used the previously identified BAC RP11-84A7

(AC243906). To close gap 1, we screened for clones mapping to

gap regions by using a method similar to that previously reported

for placing fosmids in the genome.20 We locally aligned fosmid

end sequences to the H1 assembly and H2 pseudo-assembly by

using MegaBLAST.34 Clones under consideration were subse-

quently limited to those with an alignment either within the

spacer sequence (represented in AC217768) or at the proximal

end of AC139677. Local alignments were then extended into

global alignments with needle, a Needleman-Wunsch algorithm

implementation from the EMBOSS software suite.35 We scored

global alignments for mismatches and gaps by only using bases

with Q30 or higher quality. Paired end-sequence placements

were then screened on the basis of concordant clone-end orienta-

tion and estimated insert size. Subsequently, clone-end orienta-

tion and size-concordant placements were assigned to the H1

haplotype, other paralogous sequence in the H2 haplotype, or

sequence that mapped adjacent to or within the proximal gap;

sequence identity was used as a tie-breaker. Importantly, for all

clones chosen, end sequences were best assigned to sequence adja-

cent to the gap or inferred sequence within the gap and not at

paralogous sequence elsewhere in the H1 or H2 assemblies. We

selected three clones for sequencing: two clones extending proxi-

mally and distally from the spacer sequence on AC217768

(1134622_I19 and 50932900_K17; AC244164 and AC244161,

respectively) and one clone (1013914_P2; AC244163) extending

proximally from the proximal end of AC139677 (Figure S2). The

three fosmids and the BAC clone used for closing gaps in the H2

haplotype were sequenced and assembled at The Genome Insti-

tute at Washington University. Consistent with our hypothesized

structure for RP11-374-N3, distal portions of 50932900_K17

(AC244161) and proximal portions of 1013914_P2 (AC244163),

which mapped to gap 1, were paralogous and in direct orientation

to SDs on the H1 and H2 haplotypes proximal to unique deleted

sequence (Figure S2, Figure S3, and Figure S4). Similarly,

1134622_I19 (AC244164) mapped entirely to finished sequence

(all from AC217768; Figure S5) in the H2 assembly and contained

sequence that was paralogous, but of inverted orientation (based

on end-sequence placement), to SDs on the H1 and H2 haplotypes

proximal to unique deleted sequence.

Next-Generation Sequencing, Complete Genome

Sequencing, and Breakpoint Mapping with SUNsMassively parallel sequence data were generated from three

probands with both SOLiD and Illumina sequencing platforms.

Formembers of family 2, longmate-paired libraries were generated

from 100 mg of genomic DNA, which was isolated from peripheral

blood samples via QIAampmini columns (QIAGEN). Library prep-

aration was essentially as described in the SOLiDv3.5 library prep-

arationmanual (Applied Biosystems). Of note, we performed DNA

size selections directly after CAP adaptor ligation to select genomic

fragments between 2 and 3 kbp and, moreover, to reduce the pres-

ence of concatamers. Additionally, we performed a size selection

after library amplification. To assess the presence of adaptors and

determine the average insert sizes, we cloned libraries and chose

384 clones per library for capillary sequencing. Initially, we

sequenced two 50 bp mates for each library (F3 and R3 tags) on

a SOLiD 3PLUS instrument and thereby used a single quadrant

for the father and mother of the sequencing slide, but two quad-

rants for the proband. To obtain additional read depth for the

mother and proband, we subsequently performed a 50-bp-frag-

ment run on the same libraries by using a full sequencing slide

for each on a SOLiD4 instrument.

For the family 1 proband (31928) and family 3 proband (31873),

3 mg of genomic DNA was sheared, end-repaired, an A-tail added,

and adaptors were ligated to the fragments as described in Igartua

et al.36 After ligation, the samples were run on a 6% pre-cast

polyacrylamide gel (Invitrogen, catalog number EC6265BOX).

The band at 400 bp was excised, diced, and incubated. Size-

selected fragments were amplified with 0.5 ml of primers, 25 ml of

23 iProof, 0.25 ml of SYBR Green, and 8.25 ml of dH2O under the

following conditions: 98�C for 30 s, 30 cycles of 98�C for 10 s,

60�C for 30 s, 72�C for 30 s, 72�C for 15 s, and 72�C for 2 min.

Fluorescence was assessed between the 30 and 15 s 72�C step.

Amplified, size-selected libraries were quantified with an Agilent

2100 Bioanalyzer and paired-end sequenced (101 bp reads) on

an Illumina HiSeq 2000.

Using a pipeline similar to that previously described,24 we identi-

fied 36-mer SUNKs that uniquely distinguish paralogs potentially

mediating 17q21.31 deletions in the H2 assembly. We identified

PSVs by one of two methods: First, for sequence present in the

current assembly, we used whole-genome assembly comparison

(WGAC)-defined global alignments to identify single-base-pair

differences between paralogs (Figure S6). Second, for sequence in

the proximal gap, we identified and sequenced fosmids

(AC244161 and AC244163) extending into either side of the gap.

We subsequently identified PSVs from alignment of fosmid draft

sequences against inferred regions of paralogy on the H1 and H2

haplotypes (H1:219,599–261,693 and H2:452,165–261,693,

respectively) by using stretcher, a Needleman-Wunsch algorithm

implementation from the EMBOSS software suite (Figure S6).35

For each identified PSV, we generated all possible 36-mers incor-

porating the variant. Subsequently, we passed the 36-mers

through a series of filters. First, those containing repeat sequence

as identified by RepeatMasker and TandemRepeatFinder37 or those

within 36 bp of such sequence were excluded. Second, we used

mrFAST38 to identify all possible mappings, including that to the

H2 haplotype (GRCh37), of each 36-mer to the mouse (mm8)

The American Journal of Human Genetics 90, 599–613, April 6, 2012 601

and human reference assembly, allowing for up to two mis-

matches, insertions, or deletions (edit distance %2). For PSVs

outside gap 1, we identified SUNKs as those reads with one exact

match in the human reference assembly or the H2 haplotype,

no exact matches to the mouse genome, %10 mrFAST hits with

edit distance %2 in the human genome, and %10 mrFAST hits

with edit distance %2 in the mouse genome. SUNKs within gap

1 were defined similarly, but no matches to the current reference

assembly or H2 haplotype were allowed.

Because of high sequence identity within AC217768 in the

current H2 assembly, relatively few SUNs were identified in

gap 1. However, because all sequence in AC217768 is lost in

NAHR-mediated 17q21.31 deletions, gap 1 PSVs that are only

present elsewhere in the genome within AC217768 are still break-

point-informative for H2/H2 NAHR. Similarly, gap 1 PSVs that are

only present on the H2 haplotype proximal to or within

AC217768 are breakpoint-informative for H1/H2 NAHR. Using

these criteria, we identified additional H1/H2 or H2/H2 break-

point-informative PSVs.

Finally, we empirically validated the presence or absence of SUNs

by using data from the 1000 Genomes Project.39 As a positive

control, we identified candidate SUNKs in the combined sequence

data from nine H1/H2 CEU (Utah residents with ancestry from

northern and western Europe from the CEPH collection) individ-

uals (mean coverage 33), and H2-specific candidate SUNs without

observed mapped reads were excluded. As a negative control, we

identified candidate SUNKs in combined sequence data from

a CEU trio (mean coverage 27.63; NA12878, NA12891, and

NA12892) and from an YRI trio (mean coverage 21x; NA19238,

NA19239, and NA19240), all with H1/H1 genotypes. H2-specific

candidate SUNs were discarded if observed at a read depth above

theminimumH1-specific SUN read depth in two ormore samples.

A similar validation procedurewas carriedout forH1-specific SUNs.

We used next-generation sequencing data from probands to refine

the breakpoints of the rearrangement on the basis of the absence

or presence of reads mapping to these unique identifiers.

Results

We briefly review the structural features of the 17q21.31

microdeletion locus. Within the current reference

assembly (GRCh37), the locus is defined approximately

by chr17:43.4–44.8 Mbp. The locus encompasses ~600 kbp

of unique sequence. This sequence contains several genes,

including MAPT, CRHR1, and KANSL1 (previously known

as KIAA1267), and is flanked by extensive SDs. The

17q21.31 locus has two major structural haplotypes span-

ning ~1.5Mbp: the H1 haplotype, which is most common,

and the H2 haplotype, which is present at a frequency of

20% in Europeans.25,27,40 BAC-based, haplotype-specific

sequence assemblies of the H1 and H2 haplotypes have

previously been created from the BAC library RP11, which

was derived from an H1/H2 individual.26 The reference

assembly at 17q21.31 represents the H1 haplotype, and

the H2 is presented as an alternate haplotype

(chr17_ctg5_hap1). These two haplotypes are distin-

guished by the presence of an approximately 970 kbp

inversion in addition to more than 300 kbp of differences

in the copy number and content of SDs (Figure S7).25,26

Importantly, the H2 haplotype contains 95 kbp of SD in

direct orientation flanking the unique region, whereas no

such sequence is observed in the H1 haplotype. Recurrent

deletions at this locus cause the 17q21.31 microdeletion

syndrome (MIM 610433), in which deletions only arise

in parents with one or more H2-bearing chromo-

somes.41–43 NAHR involving only this H2-specific duplica-

tion is hypothesized to underlie the H2 predisposition to

microdeletion.26

Our goal was to localize the breakpoints of recurrent

17q21.31 deletions in six individuals of European descent.

This set included three families wherein de novo microde-

letions had been previously identified41 and for which

transformed cell lines had been constructed from the

proband and both parents, as well as three unrelated

probands with the 17q21.31 deletion, for further anal-

ysis.9 To assess the accuracy of our experiments, we pro-

ceeded in a series of steps whereby we developed genomic

resources to simplify and validate our findings as needed.

To remove the potential confounding effects of large-scale

differences on different structural haplotypes on chromo-

some 17, we initially isolated deletion-bearing chromo-

somes by using somatic cell hybrids (reviewed in Trask

et al.44) from both the transmitting parent and the

proband (Figure 1). This allowed us to design the ideal

array CGH experiment, where duplicated sequences flank-

ing the critical region could be compared in the isolated

donor and deleted chromosomes (Figure 1B, Figure 2).

Once we refined the location of the paralogous segments

where breakpoints were likely to occur, we focused on

obtaining sequence-level breakpoint resolution in the

three probands with parental information. It then became

necessary to discover and characterize sequence that map-

ped to gaps within the H2 haplotype; the additional

sequence allowed us to attain sequence-level breakpoint

delineation by using a combination of next-generation

sequencing and SUN identifiers.24 This breakpoint delinea-

tion was consistent with results obtained by array CGH of

somatic cell hybrids. These results give us confidence that

genome sequencing of individuals in conjunction with

SUN mapping will provide a robust method for routine

breakpoint characterization in the future.

Somatic Cell Hybrid Characterization

We constructed 36 somatic cell hybrids derived from three

parent-child trios in which the child harbored a de novo

17q21.31 deletion and from three unrelated 17q21.31-

deletion-bearing probands for whom no parental DNA

samples were available (Figure 1; Table S3). H1/H2 haplo-

type status was determined with a previously described

238 bp deletion marker within intron 9 of MAPT.27 In all

three cases for which parental DNA samples were available,

one parent was either homozygous or heterozygous for the

H2 haplotype, and the other was homozygous for the H1

haplotype. For each of the six probands and the parents

containing an H2 haplotype, we constructed at least two

human-mouse somatic cell hybrid cell lines such that

602 The American Journal of Human Genetics 90, 599–613, April 6, 2012

each of the chromosome 17 homologs (referred to as A

and B; see Material and Methods) was isolated. The crea-

tion of somatic cell hybrids isolates the 17q21.31 dele-

tion-bearing chromosome and the progenitor parental

chromosome prior to deletion and thereby facilitates

breakpoint detection (Figure 1).

We initially genotyped the somatic cell hybrids by using

eight microsatellite markers (Figure S8 and Table S3) to

assess the integrity of each chromosome 17 homolog and

confirm that deletions originated from the parent carrying

the H2 haplotype. In family 1, markers immediately flank-

ing the deletion locus in the proband (31928) indicate that

it probably arose as a result of interchromatidal NAHR

(between sister chromatids), as expected. In family 2, the

deletion occurred in the gamete of the mother (31918),

who is homozygous for the H2 chromosomes and is also

suggestive of interchromatidal NAHR. Finally, in family

3, crossover between the H1 and H2 haplotypes and the

17q21.31 deletion co-occur within a genetic distance of

less than 0.54–1.32 cM, as determined by the Marshfield

map and HapMap, respectively.28,45 Because of the short

genetic distance separating the events, these preliminary

results suggested the possibility that unequal crossover

between the H1 and H2 haplotypes generated the deletion

within this family. We tested an additional seven microsa-

tellite markers flanking the deletion locus (Table S3).

The results remained consistent with interchromosomal

but not intrachromatidal NAHR for family 3 (Figures S8

and S9).

Haplotype-Specific Array CGH

We next performed haplotype-specific array CGH by using

matched chromosome 17 hybrid cell lines (Figure 1B;

Material and Methods; GEO accession code GSE34867).

For each family, we hybridized DNA from a line containing

the 17q21.31-deletion-bearing chromosome of the child

against the corresponding H2-haplotype-bearing hybrid

cell line from the parent. As expected, deletions within

the unique portion of 17q21.31 were readily apparent

(relative copy number 0; Figure 2). Deletions within the

SDs were detectable but displayed intermediate levels of

signal loss proportional to the number of paralogous

copies elsewhere on chromosome 17. We observed similar

patterns and log2 ratio signal intensity for both families 1

A unaffectedchr17

17q21-delchr17

electrofusion

human/mousehybrid cells loss of

unaffectedchr17

loss of17q21-delchr17

mousegenome

isolated 17q21 delchr17

isolated unaffectedchr17

phased markergenotyping

haplotype-specificarray CGH

haplotype-specific high-throughput sequencing

B

locus-specific PSVs

C

inferred region of crossover

maximum extent of deletion

isolated H2chromosomefrom parent

isolated 17q21 delchr17 from proband

haplotype-specific array CGH

genomic position

log2

rat

io

relativecopy number = 0

relativecopy number = 0.5

17q21 del(test)

unaffected H2 chromosome(reference)

inferred region of crossover(high coverage)

or

Figure 1. Schematic of SD-Breakpoint Detection Approach(A) After the creation of human/mouse hybrid cells, clonal populations that carried only one of two chromosome 17 homologs wereselected. The 17q21.31 deletion-bearing chromosome could then be studied in isolation from the unaffected chromosome 17.(B) Hybrid cell lines permit haplotype-specific array CGH. NAHR-mediated deletions (bottom schematic, gray box) remove both uniquesequence and SD (block arrows). Deletions in unique sequence are seen as extremely low signal representing relative copy number 0 (log2ratio plot schematic). Copy-number loss in SD displays intermediate signal loss proportional to the number of remaining paralogouscopies elsewhere in the genome (in the schematic, relative copy number ¼ 0.5).(C) For NAHR-mediated deletions, unequal crossover within SDs (rectangles) removes PSVs specific to the proximal and distal duplicons(vertical hashes in upper and lower rectangle halves, respectively), which can be used to infer themaximal extent of the deletion and theregion of crossover. At low coverage, the absence of reads mapping to a PSV might reflect lack of sequence coverage. At sufficiently highcoverage, however, the absence of reads mapping to a PSV (gray vertical hashes) implies the absence of the PSV in the sample and canfurther refine the crossover region.

The American Journal of Human Genetics 90, 599–613, April 6, 2012 603

6 − 4 −

2 − 0

2

family 1

H2

o i t a R

2 g o L

0 250000 500000 750000 1000000 1250000 1500000

gap

6 − 4 −

2 − 0

2

family 2

H2

o i t a R

2 g o L

0 250000 500000 750000 1000000 1250000 1500000

6 − 4 −

2 − 0

2

family 3

0 250000 500000 750000 1000000 1250000 1500000

gap

s n o i t a c i l p u D

l a t n e

m

g e S

0 250000 500000 750000 1000000 1250000 1500000

H2

o i t a R

2 g o L

D C B A

Potential NAHRBreakpoints

H2 CONTIG Position

A

B

C

D

Figure 2. Haplotype-Specific Comparative Genomic Hybridization of Three 17q21.31 Deletion-Bearing Chromosomes versus anUnaffected H2 Chromosome 17(A) Somatic cell hybrid DNA allowed for array CGH comparing specific 17q21 haplotypes. Relative gain (black), loss (gray) and gains andlosses >3 standard deviations beyond the chromosome 17 mean (green and red, respectively) are plotted against genomic position ona previously described sequence assembly of the H2 haplotype.26

(B) Pairs of segmental duplications (SDs) in direct orientation as determined by sequence comparison6 are shown as pairs of coloredblocks. If we assume that the deletions occurred due to NAHR, there are four pairs of directly oriented SDs that canmediate the rearrange-ment (breakpoints A–D). The percent identity between SDs is 98.6%, 99.2%, 99.3%, and 99.7% for breakpoints A, B, C, and D, respec-tively. Because chromosome 17 homologs are initially haploid within somatic cell hybrids, deletions within unique regions of thegenome (family 2, yellow highlight) are seen as an extremely low signal corresponding to relative copy number 0. In contrast, deletionswithin SDs display intermediate levels of signal loss as a result of cross-hybridization from paralogous sequence elsewhere in the genome.The light blue highlights in family 2 (A) represent a deletion that occurred within SDs (not shown) and that resulted in a relative loss ofsignal at both locations, potentially confounding breakpoint analysis.

604 The American Journal of Human Genetics 90, 599–613, April 6, 2012

and 2, whereas the deletion in family 3 showed a different

pattern by array CGH. We noted, for example, that some

signal loss proximal to 340 kbp and distal to 1.38 Mbp

was not observed in the other individuals (Figure 2;

Figure S10).

We hypothesized that the array CGH signature observed

in family 3 was a consequence of interchromosomal NAHR

and sought to assess its relative frequency in 17q21.31-

deletion-bearing probands. Further examination of the

three additional unrelated 17q21.31-deletion-bearing

probands by array CGH showed log2 ratios similar to those

in families 1 and 2 (Figure S11). The breakpoints for these

three additional individuals had been previously analyzed

by array CGH of diploid DNA42 and provided a benchmark

for comparison. We also surveyed 12 additional 17q21.31

spontaneous deletions by using a combination of a

lower-resolution array CGH platform and marker segrega-

tion and noted only one further case, which was consistent

with the H1/H2 recombination pattern identified in family

3. Thus, on the basis of our analysis with somatic cell

hybrids (1/6) and examination of other data (1/12), H1/H2

deletions account for ~10% of cases.

Under the assumption that the 17q21.31 deletions arose

as a result of NAHR between high-identity SDs, we devel-

oped a breakpoint analysis method that compares the

array CGH signal intensity to the expected changes in

relative copy number of high-identity SDs bracketing the

critical region (see Material and Methods). Analysis of the

H2 assembly predicted four possible pairs of paralogous

sequences (breakpoint regions A–D; Figure 2; Figures S7

and S10) under a model of H2 interchromatidal NAHR.

Examining SDs at the proximal deletion breakpoint, we

observed a predicted region of copy number 0 (yellow

highlight, Figures S10A and S10C) for breakpoints A–C.

Although array CGH data from family 3 demonstrated

a log2 signal consistent with a copy number of 0 in this

region, the same degree of signal loss was not observed

in either family 1 or family 2. This suggests that deletions

for both family 1 and family 2 are mediated by sequences

at breakpoint D. Similarly, the distal breakpoint, a region

of predicted copy number 0 (yellow highlight, Figures

S10B and S10D), for breakpoints A–C is inconsistent with

the log2 ratios observed in families 1 and 2. Thus, the

most likely sequences mediating NAHR for families 1 and

2 are those of breakpoint D, corresponding to a pair of

directly oriented SDs with >99% identity and a length of

~75 kbp in the current H2 assembly.

In contrast to that in families 1 and 2, relative copy-

number loss proximal to 340 kbp and distal to 1.38 Mbp

in family 3 (orange highlight, Figure S10) was not consis-

tent with intrachromosomal NAHR involving any of the

breakpoints A–D but was consistent with the previous

microsatellite data suggesting that the family 3 deletion

might be mediated by interchromosomal NAHR between

the H1 and H2 haplotypes. This was paradoxical; it

would require sequence proximal to the unique deleted

sequence on the H1 haplotype to directly orient with

paralogous sequence distal to the unique deleted sequence

on the H2 haplotype. However, such sequences are

not currently observed in the current H2 assembly

(Figure S7 and Table S4).26 This suggested several possible

hypotheses. If the H1/H2 crossover and the deletion

were separate events, then the family 3 deletion could

have occurred on an H2 haplotype with altered copy

number within SDs or might not have been the result

of NAHR. Alternatively, interchromosomal crossover

between the H1 and H2 haplotypes might have occurred

as a result of sequences not currently represented in the

H2 assembly.

We performed array CGH between hybrid cell lines con-

taining the H2 chromosome from the mother in family 3

and the mother in family 2 and observed no copy-number

differences across the region (Figure S12). This suggested

that the unusual log2 ratio observed for the deletion in

family 3 was not the result of structural variation or poly-

morphism on the H2 haplotype.

Closing the Sequence Gaps in the H2 Assembly

We explored the possibility that crossover between H1 and

H2 haplotypes is mediated by previously unrepresented

sequence in the current haplotype assembly. There are

two gaps within the current H2 assembly in GRCh37

(gap 1 and gap 2; Figure 3), both of which lie distal

to the unique deleted sequence (Figure 3; Figure S7).

Previously reported marker data suggested that gap 2

(spanned by RP11-84A7) does not contain sequence that

can mediate 17q21.31 deletions by H1/H2 NAHR.25 In

contrast, a draft sequence of RP11-374N3 (AC048388) con-

tained sequence paralogous to SDs proximal to the unique

deleted sequence on both the H1 and H2 haplotypes,

in agreement with our hypothesis that H1/H2 NAHR

might occur. This was additionally supported by the pres-

ence and orientation of microsatellites DG17S133 and

DG17S435 in RP11-374N3 (Figure S9).25

We noted that, to close gap 2 (~130 kbp), Steffansson

et al.25 had placed RP11-84A7, which was not used in

the H2 sequence assembly,26 in a BAC assembly to

connect the distal end of the H2 haplotype to the refer-

ence assembly. To reconfirm placement of RP11-84A7

(AC243906) on the H2 haplotype, we first end-sequenced

the clone and noted that the T7 end maps to the distal

portion of either the H1 or H2 assembly from Zody

et al.26 and that the SP6 end maps to AC019319 in build

36. In order to distinguish placement of RP11-84A7 on

the H2 haplotype versus the H1 haplotype, we compared

microsatellites on RP11-84A7 with those on RP11-619A10

(AC217775), the last BAC in the H2 assembly, by using

RP11-113E17, a clone assigned by Stefansson et al.25 to

the H1 haplotype, as a negative control (Figure S13 and

Table S5).Marker genotyping confirmed the predicted over-

lap between RP11-619A10 andRP11-84A7 and also demon-

strated RP11-84A7 and RP11-113E17 to be on opposite

haplotypes. Finally, the size of gap 2 was estimated as the

average size of a BAC from RP11 minus its overlap, based

The American Journal of Human Genetics 90, 599–613, April 6, 2012 605

on end-sequence placement, with sequence on either side

of the gap (130 kbp ¼ 180 kbp – 25 kbp – 25 kbp).

RP11-374N3 (AC048388) was previously determined to

span gap 1 (~70 kbp) in the H2 assembly but could not

be assembled by shotgun sequencing alone.26 We hypoth-

esized that this was due to the presence of two arms of

oppositely oriented, highly identical sequence separated

by a spacer sequence unique within the clone (Figure S2).

Importantly, the hypothesized structure suggested that

gap 1 contains sequence paralogous to SDs on the H1 and

H2 haplotypes and that this sequence might mediate

NAHR. If sequence in gap 1 largely corresponds to one of

two highly identical arms of sequence in RP11-374N3

(Figure S2), then the other duplicated arm of sequence,

entirely contained within the neighboring finished clone

AC217768, provides a good approximation of the sequence

in gap 1. On the basis of this hypothesized structure, we

estimated that gap 1 contains 40 kbp and 70 kbp of

sequence with ~99% identity to the H1 and H2 haplotypes

proximal to the unique deleted sequence, respectively.

We sequenced RP11-84A7 and additional clone-

based resources to aid in the assembly of RP11-374N3

(Figure 3). A draft assembly of RP11-84A7 (spanning gap

2; AC243906) did not contain sequence that couldmediate

17q21.31 deletions. Because RP11-374N3 (spanning gap 1)

previously could not be assembled by shotgun sequencing

alone,26 we identified three additional smaller clones of

a fosmid clone library (ABC14) from an H1/H2 individual

to effectively provide subassembly and resolve near-perfect

local duplications of the larger BAC (Material and Methods

and additional references19,46,47). As predicted, draft

sequences from these clones (AC244161, AC244163, and

AC244164) identified the presence of an additional ~70

kbp of SD in direct orientation (~99% estimated identity)

between gap 1 and the H2 haplotype proximal to unique

deleted sequence and an additional ~40 kbp of SD in direct

orientation (>99% estimated identity) between gap 1 and

the H1 haplotype. This confirmed our hypothesized struc-

ture of RP11-374N3 and therefore that previously unchar-

acterized sequence in the H2 assembly could mediate

NAHR between the H1 and H2 haplotypes in family 3.

Additionally, it suggested that the length of breakpoint

D, which probably mediated deletions in the remaining

five probands, is nearly twice as large (~145 kbp versus

75 kbp) as what is annotated in the human genome

reference.

Identification of Breakpoint-Informative Paralogous

Sequence Variants

To achieve sequence-level resolution, we identified SUNs,

PSVs unique to specific loci in the genome (Figure 1C), as

well as other breakpoint-informative PSVs within the SDs

mediating the observed 17q21.31 deletions.24 We used

two different techniques to identify breakpoint-informa-

tive PSVs (Figure S6). For sequences present in the current

H2 assembly, we identified PSVs by using WGAC as

described previously6 to generate alignments of paralogous

sequence. To create SUNs, we then filtered PSVs by deter-

mining which PSVs could generate unique 36 bp reads

with respect to the human and mouse genomes (Material

and Methods). For sequences mapping to gaps in the

current H2 assembly, PSVs were identified from the align-

ment of the fosmid draft sequences mapping to gap 1

with the expected regions of paralogous sequence on the

H1 and H2 haplotypes (Material and Methods). This

technique could be useful with other regions that have

alternate structural haplotypes and where a haplotype-

specific sequence assembly might not exist, yet where

the haplotype of a given clone is known. Subsequent

filtering of these PSVs revealed relatively few SUNs in gap

1 (Table 1). This was due to the near identity of sequence

within gap 1 to sequence immediately proximal on

AC217768 in the H2 assembly (Figure S2). This sequence,

however, would be lost in the event of H1/H2 or H2/H2

NAHR. Therefore, gap 1 PSVs present elsewhere only

within AC217768 would still be breakpoint informative

H2 haplotype

AC217778

AC217769

AC138688

AC127032

AC217772

AC217779

BX544879

AC217770

AC225613

AC217768

AC139677

AC217775

AC019319(build 36)

build 36, chr17: 40799295

RP11-84A7

100kb

RP11-374N3RP11 BACs sequenced

ABC14 fosmids sequenced

Figure 3. Completion of the H2 Contig with Clone-Based ResourcesTwo gaps exist in the H2 contig (dotted vertical lines). The distal gap (gap 2, ~130 kbp) is spanned by the previously placed BAC RP11-84A7.25 So that the proximal gap (gap 1, ~70 kbp) can be closed, assembly of RP11-374N3 will be completed with the assistance of addi-tional clones from the fosmid library of an H1/H2 individual (ABC14, NA12156).

606 The American Journal of Human Genetics 90, 599–613, April 6, 2012

in the event of H2/H2NAHR andwould thus effectively act

as SUNs (Material and Methods). Similarly, gap 1 PSVs

present elsewhere in the genome but exclusively on the

H2 haplotype within or proximal to AC217768 would

effectively act as SUNs in the event of H1/H2 NAHR.

After quality control (Material and Methods), we identi-

fied 4,680 36-mers corresponding to 187 distinct PSVs that

can be used to distinguish deletions due to H2 interchro-

matidal NAHR and 3,912 36-mers corresponding to 142

distinct PSVs that can be used to distinguish H1/H2 inter-

chromosomal NAHR (Table S6).

Resolution of CNV Breakpoints within Paralogous

Sequence

We leveraged the specificity of next-generation sequence

data to achieve sequence-level breakpoint resolution in

the three parent-child trios by mapping genome sequence

data to this set of SUN identifiers. We initially compared

sequence patterns between the proband and mother for

family 2 by generating whole-genome sequence from

both individuals. We generated ~26 Gbp of sequence (~9-

fold coverage) for the family 2 mother, who was an H2-

homozygote, by using the SOLIDv4 sequencing platform.

As expected, reads aligned to breakpoint-informative

PSVs across both the proximal and distal paralogs of the

breakpoint D region: an ~145 kbp region of near-perfect

sequence identity including previously uncharacterized

sequence mapping to the gap in the H2 assembly. This

finding is consistent with the finding, from array CGH

results from somatic cell hybrids, that themother is diploid

across the 17q21.31 microdeletion region (Figure 4). In

stark contrast, when genome sequence (44 Gbp, ~15-fold

coverage) was generated from the proband in family 2 and

mapped to these variants, we observed no aligned reads to

PSVs on the proximal paralog of breakpoint D past the H2

position at 508,415 bp and no aligned reads to PSVs before

the H2 position at 1,209,274 bp on the distal paralog. This

localizes the crossover between the paralogs and refines

the deletion breakpoints from a 145 kbp region based on

array CGH to a ~22 kbp window (H2:508,415–529,961 on

the proximal paralog and Gap 1: 56,251 to H2:1,209,274

on the distal paralog; chr17_ctg5_hap1:567,056–588,595

on the proximal paralog and gap 1 to chr17_

ctg5_hap1:1,317,189 in the GRCh37 genomic sequence).

This breakpoint includes the 50 UTR of KANSL1.

We repeated this mapping strategy by focusing on the

remaining two probands. We generated ~42 Gbp of

whole-genome sequence (~14-fold coverage) for the

proband from family 1 (31928) and ~46 Gbp of sequence

(~15-fold coverage) for the proband from family 3 (31873)

by using Illumina Hi-Seq2000 platform. In family 1, we

narrowed the deletion breakpoints to a ~4 kbp window

(H2:554,425–558,503 and H2:1,233,725–1,237,776 on

the proximal and distal paralogs, respectively; chr17_ctg5_

hap1:613,066–617,144 and chr17_ctg5_hap1:1,341,640–

1,345,691 in the GRCh37 genomic sequence) that includes

the first coding exonofKANSL1. Althoughweobserve a few

sequence read alignments to PSVs outside of these break-

point intervals, the hits are not collinear, and we attribute

these to either polymorphisms between the H1 and H2

haplotypes or spurious PCR-induced mutations that arose

during library prep. Finally, we observed no reads aligning

to PSVs from the proximal segment of breakpoint D in

family 3, but we did observe sequence alignments after

the gap 1 position at 45,302 bp on the distal paralog, which

aligns to thepositionat 248,866bpon theH1assembly.The

first PSVobserved on the H1 assembly proximal to this is at

theH1position at 224,601 bp. This places the breakpoint in

a ~24 kbp window (chr17:43,668,073–43,692,338 in the

GRCh37 genomic sequence) upstream of CRHR1 on the

H1 chromosome and completelywithin the gap 1 sequence

of the H2 chromosome. This pattern is consistent with our

previous hypothesis of H1/H2-mediated NAHR because

such a crossover occurs within the expected region of

directly oriented H1/H2 SDs and would remove the prox-

imal paralog of breakpoint D in its entirety.

Discussion

We employed a combination of technologies and analyses

that allow for breakpoint delineation within genomic

regions previously refractory to analysis. We note three

key components of our analysis. First, generation of

Table 1. Summary of Identified Breakpoint-Informative PSVs

NameH2 Proximaland Distal

H1/H2 InferredProximal

H1/H2 InferredDistal

H2/H2 InferredProximal

H2/H2 InferredDistal

H2/H2Informative

H1/H2Informative

Region(s) H2:519,560–593,627 bp, H2:1,198,880–1,273,881 bp

H1:219,599–261,693 bp

H2, gap 1 H2:452,165–519,559 bp

H2, gap 1 NA NA

Description breakpoint D,proximal anddistal paralogs

inferred H1paralog to gap 1

PSVs inferredfrom alignmentto H1

inferred H1paralog to gap 1

PSVs inferredfrom alignmentto H2

H2 proximal,H2 distal, andH2/H2 inferredproximaland distal

H2 proximal,H2 distal, andH1/H2 inferredproximaland distal

k-mers 2,627 845 440 858 1,195 4,680 3,912

PSVs (SUNs) 86 (86) 37 (37) 19 (1) 40 (40) 61 (2) 187 142

The American Journal of Human Genetics 90, 599–613, April 6, 2012 607

somatic cell hybrids isolating chromosome 17 homologs

greatly simplified microsatellite and array CGH analysis

by providing haplotype-specific genetic data. Marker geno-

types were phased and allowed inferences to be made on

the basis of markers within SDs. Removal of the confound-

ing effects of an alternate haplotype was of particular rele-

vance for 17q21.31 so that copy-number polymorphisms

of NSF on the H1 haplotype could be resolved.25 Although

it is impractical to routinely design somatic cell hybrids for

individuals, these reagents proved powerful in helping to

interpret and validate our findings in this study. Final vali-

dation of our results would benefit from future technology

that allows Mbp-scale sequencing of single molecules from

proband DNA.

Second, when examining copy-number losses within

SDs, we found that it was crucial to discern the degree of

loss as a function of duplication copy number. Analysis

of observed log2 ratios versus expected relative copy

A

B D

C

Figure 4. Breakpoint-Informative PSVs Identify 17q21.31 Deletion Breakpoints within SDsRead depth (vertical lines) at breakpoint-informative PSVs (dots) has been plotted over an alignment of the proximal (top plot) and distal(bottom) paralogs of breakpoint D in two probands (B and C) with 17q21.31 deletions and the mother from family 2 (A), who is homo-zygous for the H2 haplotype. For the proband of family 3 (D), the paralogous H1 region (D, top plot) is plotted in approximate alignmentwith the inferred region of directly oriented paralogy in gap 1. The distribution of breakpoint-informative PSVs is determined, in part, bythe relative density of repeat sequences in finished sequence (black blocks) or is inferred to be present in gap 1 (gray blocks). As expectedin unaffected H2 chromosomes (A), breakpoint-informative PSVs can be observed along the entire length of the proximal and distalparalogs of breakpoint D. In contrast, sequence data from a 17q21.31 deletion in family 2 (B) demonstrates no PSVs past the H2 positionat 508,415 bp on the proximal paralog and no PSVs proximal to the H2 position at 1,209,274 bp on the distal paralog of breakpoint D.These define the deletion breakpoints (dotted highlight) and the resulting chimeric SD product (gray highlight) of NAHR. A similar dele-tion pattern is observed in family 1 (C), althoughwith a different breakpoint (H2 position at 554,425 bp andH2 position at 1,237,776 bpon the proximal and distal paralogs, respectively), reflecting the recurrent nature of the deletion. Finally, in family 3, H1-specific PSVsare uninformative because of the paternally inherited H1 chromosome (D), but H2-specific sequences demonstrate no PSVs from theproximal paralog of breakpoint D, consistent with H1/H2 NAHR.

608 The American Journal of Human Genetics 90, 599–613, April 6, 2012

number bolstered initial evidence frommarker genotyping

that breakpoints in family 3 deletions, for example, were

distinct from those of families 1 and 2. This underscores

the utility of somatic cell hybrids in helping to provide

a sensitive framework of copy-number loss for medically

relevant regions of the genome. That is, if a particular SD

is present in two copies and if it is of importance to discern

whether zero, one, or both copies have been deleted, then

with chromosome-specific array CGH, one would need to

distinguish between relative copy number 1, 0.5, and 0,

respectively. In contrast, for array CGH using genomic

DNA, this would require distinguishing between relative

copy numbers 1, 0.75, and 0.5, which is substantially

more difficult. Moreover, modeling of expected versus

observed copy-number losses allowed us to infer defi-

ciencies in the current H2 assembly.

The final component of our analysis that permitted

sequence-level breakpoint resolution is the discovery of

phased, locus-specific paralogous sequence variation. For

our model, locus-specific PSVs (SUNs) were known either

by virtue of an accurate, haplotype-specific reference

assembly or, for gaps in this assembly, sequencing of

clone-based resources. It is perhaps not surprising that

several (2/3) of the breakpoints map to the few remaining

gaps in the duplicated regions given that these are the

most highly identical, the most difficult to resolve, and

the most likely to mediate NAHR.5,48 In some cases, we

were able to refine the breakpoints to a small interval of

4 kbp, whereas in other cases the breakpoints are still quite

large at 22 kbp. However, in large regions of perfect

sequence identity, it will be impossible to refine the inter-

vals any further unless discriminating SNPs specific to indi-

vidual families can be discovered.

Our analysis also yielded biological insights regarding

the 17q21.31 locus and its underlying rearrangements

(summarized in Figure 5). We identified additional SDs

Family 1 breakpointsFamily 2 breakpointsFamily 3 breakpoints

H1

H2

KANSL1MAPTCRHR1 LRRC37A

ARL17B

LRRC37A2

ARL17A

LRRC37A4

ARL17A

43.8543.55 43.95 44.05 44.1543.7543.65 44.25 44.35 44.45 44.55 44.65GRCh37 chr17

Segmental Duplications

H2-Specific Duplication

Homologous sequence mediating H1/H2 NAHR

H2GAP 1LRRC37A KANSL1 MAPT CRHR1

ARL17B

LRRC37A2 LRRC37A4

ARL17A

GAP 2

Contig1 Contig2

Figure 5. Summary of 17q21.31 Breakpoints on H1 and H2 Reference AssembliesSequence from breakpoint intervals was extracted from the H1 and H2 assemblies, aligned to the human reference sequence (GRCh37),and plotted on each haplotype. Coordinates represent the H1 haplotype on chromosome 17 (in Mbp), and hashed orange boxes repre-sent segmental duplications. The H2-Specific Duplication, which contains sequence that mediates the NAHR event in family 1, is rep-resented as solid orange blocks. Family 1 breakpoints were refined to a 4 kbp interval (green line) disrupting the first coding exon ofKANSL1. The distal breakpoint of the microdeletion observed in family 2 (red line) falls in gap 1 (hashed black box) and has been refinedto a 22 kbp interval within the 50UTR of KANSL1. In addition, another segment of gap 1 sequence (hashed gold box) is homologous to H1sequence thatmediates the H1/H2 NAHR event leading to themicrodeletion in family 3, which has been narrowed to a 24 kbp of perfectsequence identity. Gap 1 sequence has been resolved with fosmid clones, resulting in contig 1, and gap 2 sequence has been resolvedwith BAC clones, resulting in contig 2.

The American Journal of Human Genetics 90, 599–613, April 6, 2012 609

critical to understanding the genetic basis for the unequal

crossing over that mapped to the gap of the H2 assembly.

First of all, we find that ~90% of 17q21.31 rearrangement

events (16/18 based on specific screening for the H1/H2

events) occurring as a result of interchromatidal NAHR

are driven by European-specific SDs on the H2 haplotype.

Second, all interchromatidal events were mediated by

a single pair of SDs that were ~145 kbp and had ~99% iden-

tity, which accounts for 84% of the directly oriented SDs

flanking the unique deleted sequence. In the two cases

where we refined these breakpoints by using genome

sequence data, the exact breakpoints differed but both

localized to the same 99% identity segment. In both cases

the rearrangements are predicted to disrupt KANSL1—for

example, the family 2 breakpoints occur precisely in the

first exon of this gene. It is noteworthy that the same dupli-

cations are highly stratified and have risen to high

frequency in individuals of European descent.25

We also show that 17q21.31 deletions can occur as

a result of interchromosomal NAHR between the H1 and

H2 haplotypes. Our limited survey of 17q21.31 break-

points indicates that interchromosomal NAHR is relatively

uncommon. One case was previously identified,43 and

we observed it independently twice in 18 probands, sug-

gesting that such events account for ~10% of 17q21.31

microdeletions. This is also compatible with previous

population genetic data and theoretical predictions that

crossovers between the H1 and H2 haplotypes are effec-

tively suppressed. Interchromatidal deletions are probably

more common than interchromosomal deletions for

several reasons. First, sperm typing has shown NAHR due

to interchromatidal deletions to be the predominant class

of NAHR.49 Second, the interchromosomal paralogous seg-

ments mediating unequal crossover are smaller (40 kbp

versus 145 kbp) and less numerous than those that can

mediate interchromatidal NAHR. Finally, most crossover

events between H1 and H2 in this region would be

between allelic sequences in inverted orientation, creating

the classic acentric and dicentric chromosomal products of

a paracentric inversion, and are therefore inviable.

17q21.31 represents one of the most studied human

genomic loci for which a complex alternate structural

haplotype has been generated. Additional loci have either

been implicated in pathogenic deletions or have been

shown to have structural haplotypes predisposing an

individual to such deletions.50,51 Unlike the 17q21.31

locus, none of these regions, to our knowledge, yet have

haplotype-specific sequence assemblies. Although this

presents a challenge, the methods we have developed

provide a clear path forward to fine-mappingof breakpoints

within segmental regions both in basic research and, ulti-

mately, in a clinical setting. We propose the following

strategy. In lieu of somatic cell hybrids, recently developed

methods involving next-generation sequencing of flow-

sorted chromosomes52 or pooled fosmids53 could be em-

ployed for the rapid generation of haplotype-specific

sequence data, recovery of sequence information within

the gaps, and discovery of large structural polymorphisms.

Phased, locus-specific paralogous sequence variation could

be generated through targeted sequencing of clone-based

resources that now exist for more than 30 human

genomes19,51,54 or through conventional46 or massively

parallel sequencing53 methods. This would allow the estab-

lishment of high-quality alternate reference haplotypes of

the human reference genome as is being pursued by

the Genome Reference Consortium (Online Resources).

These data could be used in the creation of a catalog of

SUN identifiers so that breakpoints in deletion probands

could be refined. Once such a catalog was established,

itwouldbe relatively trivial to routinely delineate thebreak-

points of duplication anddeletionprobandswith extraordi-

nary precision by mapping complete genome sequencing

to this catalog of sequence variants. This is important

clinically for distinguishing breakpoints that are superfi-

cially similar (by array CGH) but that have different func-

tional consequences with respect to breakpoints within

duplicated genes or portions of genes (e.g., CHRNA7,55

SIRPB119 orKANSL1 [present study]). It is possible that these

differences in breakpoints contribute to the variability of

expressivity for genomic disorders and, as such, that it will

be important to distinguish between them in the future.

Supplemental Data

The Supplemental Data include 13 figures and six tables and can

be found with this article online at http://www.cell.com/AJHG.

Acknowledgments

We thank B. Coe, S. Ng and J. Hehir-Kwa for thoughtful

discussion, T. Brown for assistance with manuscript preparation,

A. Mackenzie, C. Igartua, C. Fields, S. Casadei, L. Vives, members

of the Mayo Medical Laboratories, members of The Genome Insti-

tute at Washington University, and members of the Hubrecht

Institute for assistance with data generation, and B. de Vries for

clinical collection and evaluation of individuals with 17qmicrode-

letions and their parents. K.M.S. was supported by a Ruth L.

Kirschstein National Research Service Award (NRSA) Fellowship

(F32GM097807). This work was supported by National Institutes

of Health grants HG002385 and HG004120 to E.E.E, and the

Netherlands Organization for Health Research and Development

(ZonMW 916.86.016 to L.E.L.M.V., and 917.66.363 to JAV).

E.E.E. is an investigator of the Howard Hughes Medical Institute.

E.E.E. is on the scientific advisory boards for Pacific Biosciences,

Inc. and SynapDx Corp.

Received: November 30, 2011

Revised: January 23, 2012

Accepted: February 16, 2012

Published online: March 29, 2012

Web Resources

The URLs for data presented herein are as follows:

1000 Genomes Project, http://www.1000genomes.org/

The EMBOSS software suite, http://emboss.sourceforge.net/

610 The American Journal of Human Genetics 90, 599–613, April 6, 2012

Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/

Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/

projects/genome/assembly/grc/

International HapMap Project, http://hapmap.ncbi.nlm.nih.gov/

JAligner Java implementation of the Smith-Waterman algorithm,

http://jaligner.sourceforge.net/

Marshfield Genetic Maps, http://research.marshfieldclinic.org/

genetics/

mrFAST, http://mrfast.sourceforge.net/

NCBI nucleotide database, http://www.ncbi.nlm.nih.gov/unists

NCBI BLAST and megaBLAST, http://blast.ncbi.nlm.nih.gov/

NCBI UniSTS database, http://www.ncbi.nlm.nih.gov/unists

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org/

RepeatMasker, http://www.repeatmasker.org/

Tandem Repeats Finder, http://tandem.bu.edu/trf/trf.html

UCSC Human Genome Browser (human reference genomes),

http://genome.ucsc.edu

Accession Numbers

The NCBI nucleotide accession numbers for the four clone

sequences reported in this paper are AC244161, AC244163,

AC244164, and AC243906.

The GEO accession numbers for the nine microarray experi-

ments in this paper are GSE34867.

References

1. Stankiewicz, P., and Lupski, J.R. (2010). Structural variation in

the human genome and its role in disease. Annu. Rev. Med.

61, 437–455.

2. Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O.,

Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C., Campbell, P.,

et al; Wellcome Trust Case Control Consortium. (2010).

Origins and functional impact of copy number variation in

the human genome. Nature 464, 704–712.

3. Vissers, L.E., de Vries, B.B., and Veltman, J.A. (2010). Genomic

microarrays in mental retardation: From copy number varia-

tion to gene, from research to diagnosis. J. Med. Genet. 47,

289–297.

4. Girirajan, S., and Eichler, E.E. (2010). Phenotypic variability

and genetic susceptibility to genomic disorders. Hum. Mol.

Genet. 19 (R2), R176–R187.

5. Lupski, J.R. (1998). Genomic disorders: structural features of

the genome can lead to DNA rearrangements and human

disease traits. Trends Genet. 14, 417–422.

6. Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J., and Eichler,

E.E. (2001). Segmental duplications: organization and impact

within the current human genome project assembly. Genome

Res. 11, 1005–1017.

7. Xu, B., Roos, J.L., Levy, S., van Rensburg, E.J., Gogos, J.A., and

Karayiorgou, M. (2008). Strong association of de novo copy

number mutations with sporadic schizophrenia. Nat. Genet.

40, 880–885.

8. Sebat, J., Lakshmi, B., Malhotra, D., Troge, J., Lese-Martin, C.,

Walsh, T., Yamrom, B., Yoon, S., Krasnitz, A., Kendall, J., et al.

(2007). Strong association of de novo copy number mutations

with autism. Science 316, 445–449.

9. Sharp, A.J., Mefford, H.C., Li, K., Baker, C., Skinner, C., Steven-

son, R.E., Schroer, R.J., Novara, F., De Gregori, M., Ciccone, R.,

et al. (2008). A recurrent 15q13.3 microdeletion syndrome

associated with mental retardation and seizures. Nat. Genet.

40, 322–328.

10. de Vries, B.B., Pfundt, R., Leisink, M., Koolen, D.A., Vissers,

L.E., Janssen, I.M., Reijmersdal, S., Nillesen, W.M., Huys,

E.H., Leeuw, N., et al. (2005). Diagnostic genome profiling in

mental retardation. Am. J. Hum. Genet. 77, 606–616.

11. Mefford, H.C., Sharp, A.J., Baker, C., Itsara, A., Jiang, Z.,

Buysse, K., Huang, S., Maloney, V.K., Crolla, J.A., Baralle, D.,

et al. (2008). Recurrent rearrangements of chromosome

1q21.1 and variable pediatric phenotypes. N. Engl. J. Med.

359, 1685–1699.

12. Greenway, S.C., Pereira, A.C., Lin, J.C., DePalma, S.R., Israel,

S.J., Mesquita, S.M., Ergul, E., Conta, J.H., Korn, J.M., McCar-

roll, S.A., et al. (2009). De novo copy number variants identify

new genes and loci in isolated sporadic tetralogy of Fallot. Nat.

Genet. 41, 931–935.

13. Bochukova, E.G., Huang, N., Keogh, J., Henning, E., Purmann,

C., Blaszczyk, K., Saeed, S., Hamilton-Shield, J., Clayton-

Smith, J., O’Rahilly, S., et al. (2010). Large, rare chromosomal

deletions associated with severe early-onset obesity. Nature

463, 666–670.

14. Mefford, H.C., Clauin, S., Sharp, A.J., Moller, R.S., Ullmann,

R., Kapur, R., Pinkel, D., Cooper, G.M., Ventura, M., Ropers,

H.H., et al. (2007). Recurrent reciprocal genomic rearrange-

ments of 17q12 are associated with renal disease, diabetes,

and epilepsy. Am. J. Hum. Genet. 81, 1057–1069.

15. Lupski, J.R., de Oca-Luna, R.M., Slaugenhaupt, S., Pentao, L.,

Guzzetta, V., Trask, B.J., Saucedo-Cardenas, O., Barker, D.F.,

Killian, J.M., Garcia, C.A., et al. (1991). DNA duplication asso-

ciated with Charcot-Marie-Tooth disease type 1A. Cell 66,

219–232.

16. Chen, K.S., Manian, P., Koeuth, T., Potocki, L., Zhao, Q., Chi-

nault, A.C., Lee, C.C., and Lupski, J.R. (1997). Homologous

recombination of a flanking repeat gene cluster is a mecha-

nism for a common contiguous gene deletion syndrome.

Nat. Genet. 17, 154–163.

17. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody,

M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh,

W., et al; International Human Genome Sequencing Consor-

tium. (2001). Initial sequencing and analysis of the human

genome. Nature 409, 860–921.

18. Lee, J.A., Carvalho, C.M., and Lupski, J.R. (2007). A DNA

replication mechanism for generating nonrecurrent rear-

rangements associated with genomic disorders. Cell 131,

1235–1247.

19. Kidd, J.M., Cooper, G.M., Donahue, W.F., Hayden, H.S.,

Sampas, N., Graves, T., Hansen, N., Teague, B., Alkan, C.,

Antonacci, F., et al. (2008). Mapping and sequencing of struc-

tural variation fromeight humangenomes.Nature453, 56–64.

20. Tuzun, E., Sharp, A.J., Bailey, J.A., Kaul, R., Morrison, V.A.,

Pertz, L.M., Haugen, E., Hayden, H., Albertson, D., Pinkel,

D., et al. (2005). Fine-scale structural variation of the human

genome. Nat. Genet. 37, 727–732.

21. Ye, K., Schulz, M.H., Long, Q., Apweiler, R., and Ning, Z.

(2009). Pindel: a pattern growth approach to detect break

points of large deletions and medium sized insertions from

paired-end short reads. Bioinformatics 25, 2865–2871.

22. Korbel, J.O., Urban, A.E., Affourtit, J.P., Godwin, B., Grubert,

F., Simons, J.F., Kim, P.M., Palejev, D., Carriero, N.J., Du, L.,

et al. (2007). Paired-end mapping reveals extensive structural

variation in the human genome. Science 318, 420–426.

The American Journal of Human Genetics 90, 599–613, April 6, 2012 611

23. Pentao, L., Wise, C.A., Chinault, A.C., Patel, P.I., and Lupski,

J.R. (1992). Charcot-Marie-Tooth type 1A duplication appears

to arise from recombination at repeat sequences flanking the

1.5 Mb monomer unit. Nat. Genet. 2, 292–300.

24. Sudmant, P.H., Kitzman, J.O., Antonacci, F., Alkan, C., Malig,

M., Tsalenko, A., Sampas, N., Bruhn, L., Shendure, J., and

Eichler, E.E.; 1000 Genomes Project. (2010). Diversity of

human copy number variation and multicopy genes. Science

330, 641–646.

25. Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdot-

tir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A.,

Ingason, A., Gudnadottir, V.G., et al. (2005). A common

inversion under selection in Europeans. Nat. Genet. 37,

129–137.

26. Zody, M.C., Jiang, Z., Fung, H.C., Antonacci, F., Hillier, L.W.,

Cardone, M.F., Graves, T.A., Kidd, J.M., Cheng, Z., Abouelleil,

A., et al. (2008). Evolutionary toggling of the MAPT 17q21.31

inversion region. Nat. Genet. 40, 1076–1083.

27. Baker, M., Litvan, I., Houlden, H., Adamson, J., Dickson, D.,

Perez-Tur, J., Hardy, J., Lynch, T., Bigio, E., and Hutton, M.

(1999). Association of an extended haplotype in the tau

gene with progressive supranuclear palsy. Hum. Mol. Genet.

8, 711–715.

28. Broman, K.W., Murray, J.C., Sheffield, V.C., White, R.L., and

Weber, J.L. (1998). Comprehensive human genetic maps:

individual and sex-specific variation in recombination. Am.

J. Hum. Genet. 63, 861–869.

29. Sharp, A.J., Itsara, A., Cheng, Z., Alkan, C., Schwartz, S., and

Eichler, E.E. (2007). Optimal design of oligonucleotide micro-

arrays for measurement of DNA copy-number. Hum. Mol.

Genet. 16, 2770–2779.

30. Benovoy, D., Kwan, T., and Majewski, J. (2008). Effect of

polymorphisms within probe-target sequences on olignonu-

cleotide microarray experiments. Nucleic Acids Res. 36,

4417–4423.

31. Lee, I., Dombkowski, A.A., and Athey, B.D. (2004). Guidelines

for incorporating non-perfectly matched oligonucleotides

into target-specific hybridization probes for a DNAmicroarray.

Nucleic Acids Res. 32, 681–690.

32. Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M., and

Dubchak, I. (2004). VISTA: Computational tools for compara-

tive genomics. Nucleic Acids Res. 32, W273–W279.

33. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman,

D.J. (1990). Basic local alignment search tool. J. Mol. Biol.

215, 403–410.

34. Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000). A

greedy algorithm for aligning DNA sequences. J. Comput.

Biol. 7, 203–214.

35. Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: The

European Molecular Biology Open Software Suite. Trends

Genet. 16, 276–277.

36. Igartua, C., Turner, E.H., Ng, S.B., Hodges, E., Hannon, G.J.,

Bhattacharjee, A., Rieder, M.J., Nickerson, D.A., and Shendure,

J. (2010). Targeted enrichment of specific regions in the

human genome by array hybridization. Curr. Prot. Hum.

Genet., Chapter 18, Unit 18.3.

37. Benson, G. (1999). Tandem repeats finder: a program to

analyze DNA sequences. Nucleic Acids Res. 27, 573–580.

38. Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci,

F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu,

O., et al. (2009). Personalized copy number and segmental

duplication maps using next-generation sequencing. Nat.

Genet. 41, 1061–1067.

39. Durbin, R.M., Abecasis, G.R., Altshuler, D.L., Auton, A.,

Brooks, L.D., Gibbs, R.A., Hurles, M.E., and McVean, G.A.;

1000 Genomes Project Consortium. (2010). A map of human

genome variation from population-scale sequencing. Nature

467, 1061–1073.

40. Conrad, C., Andreadis, A., Trojanowski, J.Q., Dickson, D.W.,

Kang, D., Chen, X., Wiederholt, W., Hansen, L., Masliah, E.,

Thal, L.J., et al. (1997). Genetic evidence for the involvement

of tau in progressive supranuclear palsy. Ann. Neurol. 41,

277–281.

41. Koolen, D.A., Vissers, L.E., Pfundt, R., de Leeuw, N., Knight,

S.J., Regan, R., Kooy, R.F., Reyniers, E., Romano, C., Fichera,

M., et al. (2006). A new chromosome 17q21.31 microdeletion

syndrome associated with a common inversion polymor-

phism. Nat. Genet. 38, 999–1001.

42. Sharp, A.J., Hansen, S., Selzer, R.R., Cheng, Z., Regan, R.,

Hurst, J.A., Stewart, H., Price, S.M., Blair, E., Hennekam,

R.C., et al. (2006). Discovery of previously unidentified

genomic disorders from the duplication architecture of the

human genome. Nat. Genet. 38, 1038–1042.

43. Shaw-Smith, C., Pittman, A.M., Willatt, L., Martin, H., Rick-

man, L., Gribble, S., Curley, R., Cumming, S., Dunn, C., Kalait-

zopoulos, D., et al. (2006). Microdeletion encompassing

MAPT at chromosome 17q21.3 is associated with develop-

mental delay and learning disability. Nat. Genet. 38, 1032–

1037.

44. Trask, B.J. (2002). Human cytogenetics: 46 chromosomes, 46

years and counting. Nat. Rev. Genet. 3, 769–778.

45. Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve,

L.L., Gibbs, R.A., Belmont, J.W., Boudreau, A., Hardenbol, P.,

Leal, S.M., et al; International HapMap Consortium. (2007).

A second generation human haplotype map of over

3.1 million SNPs. Nature 449, 851–861.

46. Kidd, J.M., Cheng, Z., Graves, T., Fulton, B., Wilson, R.K., and

Eichler, E.E. (2008). Haplotype sorting using human fosmid

clone end-sequence pairs. Genome Res. 18, 2016–2023.

47. Kidd, J.M., Sampas, N., Antonacci, F., Graves, T., Fulton, R.,

Hayden, H.S., Alkan, C., Malig, M., Ventura, M., Giannuzzi,

G., et al. (2010). Characterization of missing human genome

sequences and copy-number polymorphic insertions. Nat.

Methods 7, 365–371.

48. Cooper, G.M., Nickerson, D.A., and Eichler, E.E. (2007). Muta-

tional and selective effects on copy-number variants in the

human genome. Nat. Genet. 39(7, Suppl), S22–S29.

49. Turner, D.J., Miretti, M., Rajan, D., Fiegler, H., Carter, N.P.,

Blayney, M.L., Beck, S., and Hurles, M.E. (2008). Germline

rates of de novo meiotic deletions and duplications causing

several genomic disorders. Nat. Genet. 40, 90–95.

50. Sharp, A.J., Cheng, Z., and Eichler, E.E. (2006). Structural vari-

ation of the human genome. Annu. Rev. Genomics Hum.

Genet. 7, 407–442.

51. Antonacci, F., Kidd, J.M., Marques-Bonet, T., Teague, B.,

Ventura, M., Girirajan, S., Alkan, C., Campbell, C.D., Vives,

L., Malig, M., et al. (2010). A large and complex structural

polymorphism at 16p12.1 underlies microdeletion disease

risk. Nat. Genet. 42, 745–750.

52. Fan, H.C., Wang, J., Potanina, A., and Quake, S.R. (2011).

Whole-genome molecular haplotyping of single cells. Nat.

Biotechnol. 29, 51–57.

612 The American Journal of Human Genetics 90, 599–613, April 6, 2012

53. Kitzman, J.O.,Mackenzie,A.P.,Adey,A.,Hiatt, J.B., Patwardhan,

R.P., Sudmant, P.H.,Ng, S.B., Alkan,C.,Qiu, R., Eichler, E.E., and

Shendure, J. (2011). Haplotype-resolved genome sequencing

of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63.

54. Kidd, J.M., Graves, T., Newman, T.L., Fulton, R., Hayden, H.S.,

Malig, M., Kallicki, J., Kaul, R., Wilson, R.K., and Eichler, E.E.

(2010). A human genome structural variation sequencing

resource reveals insights into mutational mechanisms. Cell

143, 837–847.

55. Shinawi,M., Schaaf, C.P., Bhatt, S.S., Xia, Z., Patel, A., Cheung,

S.W., Lanpher, B., Nagl, S., Herding, H.S., Nevinny-Stickel, C.,

et al. (2009). A small recurrent deletion within 15q13.3 is

associated with a range of neurodevelopmental phenotypes.

Nat. Genet. 41, 1269–1271.

The American Journal of Human Genetics 90, 599–613, April 6, 2012 613

ARTICLE

Primate Genome Gain and Loss: A Bone Dysplasia,Muscular Dystrophy, and Bone Cancer Syndrome Resultingfrom Mutated Retroviral-Derived MTAP Transcripts

Olga Camacho-Vanegas,1 Sandra Catalina Camacho,1 Jacob Till,1 Irene Miranda-Lorenzo,1

Esteban Terzo,1 Maria Celeste Ramirez,1 Vern Schramm,2 Grace Cordovano,2 Giles Watts,3 Sarju Mehta,3

Virginia Kimonis,3 Benjamin Hoch,4 Keith D. Philibert,5 Carsten A. Raabe,6 David F. Bishop,1

Marc J. Glucksman,5 and John A. Martignetti1,7,8,*

Diaphyseal medullary stenosis with malignant fibrous histiocytoma (DMS-MFH) is an autosomal-dominant syndrome characterized by

bone dysplasia, myopathy, and bone cancer. We previously mapped the DMS-MFH tumor-suppressing-gene locus to chromosomal

region 9p21–22 but failed to identify mutations in known genes in this region. We now demonstrate that DMS-MFH results frommuta-

tions in the most proximal of three previously uncharacterized terminal exons of the gene encoding methylthioadenosine phosphor-

ylase, MTAP. Intriguingly, two of these MTAP exons arose from early and independent retroviral-integration events in primate genomes

at least 40 million years ago, and since then, their genomic integration has gained a functional role. MTAP is a ubiquitously expressed

homotrimeric-subunit enzyme critical to polyamine metabolism and adenine and methionine salvage pathways and was believed to be

encoded as a single transcript from the eight previously described exons. Six distinct retroviral-sequence-containing MTAP isoforms,

each of which can physically interact with archetype MTAP, have been identified. The disease-causing mutations occur within one of

these retroviral-derived exons and result in exon skipping and dysregulated alternative splicing of all MTAP isoforms. Our results identify

a gene involved in the development of bone sarcoma, provide evidence of the primate-specific evolution of certain parts of an existing

gene, and demonstrate that mutations in parts of this gene can result in human disease despite its relatively recent origin.

Introduction

Diaphyseal medullary stenosis with malignant fibrous

histiocytoma (DMS-MFH [MIM 112250]) is a rare, auto-

somal-dominant bone dysplasia and cancer syndrome of

unknown etiology.1–3 The disorder has a unique bone-

dysplasia phenotype characterized by cortical growth

abnormalities, including diffuse diaphyseal medullary

stenosis with overlying endosteal cortical thickening,

metaphyseal striations, and scattered infarctions within

the bone marrow. Affected individuals endure pathologic

fractures that subsequently heal poorly, progressive

wasting, bowing of the lower extremities, painful debilita-

tion, and the development of presenile cataracts. We

recently expanded the known clinical features of the

syndrome by characterizing two new unrelated families

affected by a progressive form of muscular disease consis-

tent with facioscapulohumeral muscular dystrophy

(FSHD [MIM 158900]) (see below). Among DMS-MFH-

affected individuals, approximately 35% develop a form

of bone sarcoma consistent with the diagnosis of malig-

nant fibrous histiocytoma (MFH).1–4

Using a positional-cloning approach, we originally local-

ized the disease-associated allele locus to chromosomal

region 9p21–22 and established a 3.5 cM critical locus

between markers D9S1778 and D9S171.4 Given the cancer

component of the syndrome, the 9p21–22 region is of

particular interest in that it is one of the most frequently

deleted and/or translocated chromosomal regions in

human cancer.5 A diverse group of human cancers

demonstrate loss of this region and include gliomas,6,7

melanomas,8 non-small-cell lung cancers,9 acute leuke-

mias,10,11 and, of direct significance to this study, osteosar-

comas.12,13 In an attempt to further narrow the region as

well as establish a link between hereditary and sporadic

tumor forms, we performed loss of heterozygosity (LOH)

analysis of sporadic MFH samples. This analysis supported

a shared genetic etiology between hereditary and sporadic

MFH cases and mapped the smallest region of overlap to

the 2.9 Mb region between markers D9S736 and

D9S171.14

A number of DMS-MFH candidate genes were originally

screened by DNA sequencing and were excluded because

they lacked mutations. These genes included the cyclin-

dependent kinase inhibitor 2A (CDKN2A [p16] [MIM

600160]) and its alternatively spliced product p14-ARF,

CDKN2B (p15), members of the interferon (IFN) super-

family, and the methylthioadenosine phosphorylase

gene (MTAP [MIM 156540]).4 MTAP has been thought

to consist of eight exons and seven introns15 and encode

1Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY 10029, USA; 2Department of Biochemistry, Albert Einstein

College of Medicine, Bronx, NY 10461, USA; 3University of California Irvine, Irvine, CA 92868, USA; 4Department of Pathology, Mount Sinai School of

Medicine, New York, NY 10029, USA; 5Midwest Proteome Center and Department of Biochemistry and Molecular Biology, Rosalind Franklin University

of Medicine and Science, ChicagoMedical School, Chicago, IL, 60064, USA; 6Institute of Experimental Pathology, University of Muenster, 48149Muenster,

Germany; 7Department of Pediatrics, Mount Sinai School of Medicine, New York, NY 10029, USA; 8Department of Oncological Sciences, Mount Sinai

School of Medicine, New York, NY 10029, USA

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.024. �2012 by The American Society of Human Genetics. All rights reserved.

614 The American Journal of Human Genetics 90, 614–627, April 6, 2012

a ubiquitously expressed enzyme that plays a crucial role

in the salvage pathway for adenine and methionine in

all tissues.16 In the salvage pathways, methylthioadeno-

sine (MTA), a by-product of the polyamine pathway, is

recovered through its phosphorolysis into adenine and

methylthioribose-1-phosphate by MTAP.17 Through a

series of reactions, methylthioribose-1-phosphate is then

converted into methionine.17,18 It is suggested that loss

of MTAP activity plays a role in human cancer because its

loss has been reported in a number of cancers, including

osteosarcoma,12,13 leukemia,19 non-small-cell lung

cancer,20 malignant melanoma,21 biliary-tract cancer,22

breast cancer,23 pancreatic cancer,24 and gastrointestinal

stromal tumors.25 Reintroduction of MTAP expression

into the MCF7 breast adenocarcinoma cell line, which

lacks endogenous MTAP gene expression and enzymatic

activity, inhibits the cells’ ability to grow both in vitro

and in vivo;23 the fact that MTAP inhibits cell growth is

consistent with its presumed role as a tumor suppressor.

We have now identified and characterized the genetic

defect underlying DMS-MFH. All affected members of

five unrelated DMS-MFH-affected families possess synony-

mous mutations in the most proximal of three terminal

MTAP exons identified and characterized in these studies.

Interestingly, DNA-sequence analysis revealed that at

least two of the exons are remnants of retroviral insertions

into the primate genome. Both disease-causing mutations

in exon 9, the most proximal of these exons, result in

exon skipping and subsequent loss of this exon in alterna-

tively spliced, biologically active isoforms. Biochemical

studies provide evidence that isoforms containing exon 7

have MTAP activity. Altogether, these findings identify

a gene associated with both hereditary bone dysplasia

and osteosarcoma and also highlight the importance of

evolutionarily co-opted gene parts for both health and

disease.

Material and Methods

Linkage and Haplotype MappingAfter participants gave informed consent for these studies, which

were approved by the Human Research Protection Program at

the Mount Sinai School of Medicine, blood samples were obtained

from affected and unaffected family members. Genomic DNA was

extracted with the Puregene kit according to the manufacturer’s

(Minneapolis, MN) protocol, and individuals were genotyped

with a panel of markers spanning the length of chromosome 9;

included were markers from the Single Chromosome Scan Human

Screening Set (Research Genetics, Carlsbad, CA) and a number of

polymorphic markers that were custom generated. Family 1

from the original study was not included in this reanalysis. Allelic

determination was performed with the ABI 3130xl Genetic

Analyzer and GeneMapper 4.0 software (Applied Biosystems,

Foster City, CA). We handled the generated data by using Mega

2,26 and linkage analysis was computed by SIMWALK2 v.2.8327

with the following parameters: all families are genetically homog-

enous, inheritance pattern is autosomal dominant and has 80%

penetrance, there are no phenocopies, and the disease allele

frequency is 0.0001. Marker positions were obtained from the

Marshfield database and the UCSC Genome Browser March 2006

assembly. We further refined areas of positive location scores by

using custom-generated microsatellite markers within the defined

region. In brief, we designed the microsatellite markers by identi-

fying simple tandem repeats (STRs) in DNA sequences of clone

fragments from the Human Genome Assembly (UCSC Genome

Browser) by using the Tandem Repeats Finder Program.28 Fluores-

cently labeled primers were then designed to amplify these repeat

regions, and allele separation and analyses were performed as

previously described29 (Table S1, available online).

DNA Sequence AnalysisWe used PCR to amplify all MTAP exons by using Amplitaq-Gold

(Applied Biosystems, Foster City, CA), and we purified them by

using the QIAquick Spin PCR Purification Kit (QIAGEN, Valencia,

CA); the exons were amplified and purified according to themanu-

facturers’ protocol, and they were directly sequenced on an ABI

Prism 3700 automated DNA Analyzer (Applied Biosystems, Foster

City, CA). Data were analyzed with the program Sequencher v3.0

(Gene Codes Corporation, Ann Arbor, MI). Sets of intronic primers

that we used to amplify the coding region and intron and exon

boundaries are listed in Table S1. The PCR cycling conditions

were the following: 94�C (10 min) for 1 cycle, 94�C (30 s), 55�C(30 s), and 72�C (60 s) for 35 cycles each, and a final extension

of 72�C (10 min).

Computational DNA AnalysisWe performed computational analysis to detect aberrant splice

sites by using the NetGene2 Server30,31 and the splicing-enhancer

motif-prediction program ESEfinder Release 2.0,32 for which we

used the default parameters for the following SR proteins: SF2/

ASF, SC35, SRp40, and SRp55.

RNA Isolation, Semiquantitative Reverse

Transcription PCR, and Quantitative Real Time PCRWe extracted cultured-cell-line and tumor RNA by using the

RNeasy Mini kits, and we treated it with DNase according to the

manufacturer’s (QIAGEN, Valencia, CA) protocol. For semiquanti-

tative reverse transcription PCR (RT-PCR), we reverse transcribed

a total of 1 mg of RNA per reaction by using first-strand cDNA

synthesis with random primers (Promega, Madison, WI). To eval-

uate the transcription level of the MTAP splice variants, we per-

formed RT-PCR by using combinations of isoform-specific primers

listed in Table S1. We electrophoretically separated and visualized

RT-PCR products on a 1.5% agarose gel by using ethidium

bromide. We excised the bands and cloned and inserted them

into the pCR4-TOPO vector (Invitrogen, Carlsbad, CA), and we

sequenced 20 independent clones from each of the isolated bands

to establish their identity. For quantitative RT-PCR (qRT-PCR), we

reverse transcribed a total of 1 mg of RNA per reaction by

using iSCRIPT cDNA Synthesis according to the manufacturer’s

(Bio-Rad, Hercules, CA) protocol. We performed qRT-PCR by using

iQ SYBR Green Supermix (Bio-Rad, Hercules, CA), according to the

manufacturer’s protocol, on an ABI PRISM 7900HT Sequence

Detection System (Applied Biosystems, Foster City, CA) and by

using the primers listed in Table S1. All values were normalized

to either GAPDH or HPRT levels. All experiments were done in

triplicate, and all cell-culture experiments were independently

validated at least three times.

The American Journal of Human Genetics 90, 614–627, April 6, 2012 615

RACE: Rapid Amplification of cDNA 30 EndsTotal RNA was isolated from five primary human fibroblast cell

lines. We used a total of 1 mg of RNA per cell line, the PowerScript

Reverse Transcriptase, and the BD SMART IIA primer from the BD

SMART RACE cDNA amplification kit (according to the manufac-

turer’s [Franklin Lakes, NJ] protocol) to prepare corresponding

first-strand cDNAs. In brief, each of the 30 RACE-ready cDNAs

was used in PCR-amplification reactions with the SMART RACE

kit universal primers and sense gene-specific primers (GSP). First-

round PCR and second-round PCR were performed with the

specific primers listed in Table S1, Outer Primer (SMART RACE

kit), and Inner Primer (SMART RACE kit).

Cell Culture and TransfectionsPatient-derived osteosarcoma, fibroblast, and lymphoblast cell

lines and other commercially available cell lines (obtained from

ATCC) were maintained in Dulbecco’s modified Eagle’s medium

(DMEM) supplemented with 10% fetal bovine serum, 100 U/ml

penicillin, and 100 mg/ml streptomycin and were grown at 37�Cin 5% CO2. For the expression studies, we transfected the cells

24 hr after plating by using lipofectamine 2000 according to

the manufacturer’s (Invitrogen, Carlsbad, CA) recommended

protocol.

Expression ConstructsGST and V5 Expression Vectors

To generate the vectors expressing each of the MTAP splice

variants (archetype MTAP and MTAP_v1, _v2, _v3, _v4, _v5,

and _v6), we used RT-PCR to amplify the cDNA of each splice

variant from normal fibroblasts by using the relevant primers

listed in Table S1. In brief, the exon 1 forward primer was used

for the amplification of all variants, and the reverse primers were

designed to delete the stop codon so that a fusion protein would

be generated with the V5 epitope. The amplified products were

cloned and inserted into the pcDNA3.1/V5-His TOPO TA expres-

sion vector (Invitrogen, Carlsbad, CA). The resulting clones were

completely sequenced in both orientations prior to their use.

Partial Minigene Expression Vectors

We used genomic DNAs obtained from a normal fibroblast cell line

and two patient-derived fibroblast cell lines to amplify the

following fragments: an 8 kb fragment containing exons 6–8

(flanked by introduced SalI and NheI restriction sites; subcloned

into the pcDNA3.1/V5-His TOPO TA vector), 2.1 kb fragments

containing exon 9 and carrying the wild-type (WT) DNA or either

the c.813-2A>G or c.885A>G mutation (flanked by introduced

SpeI and XhoI restriction sites; cloned into the pCR4-TOPO

vector), and a 3 kb fragment containing exons 10 and 11 (flanked

by introduced XhoI and SacII restriction sites; cloned into the

pCR4-TOPO vector). Each 2.1 kb fragment was digested with

SpeI and XhoI, and we cloned the fragments and inserted them

into a SpeI/XhoI-digested pcDNA3.1 (exons 6–8) vector to

generate the following three pcDNA3.1 (exons 6–8 and 9)

constructs:WT, c.813-2A>G, and c.885A>G. Finally, the 3 kb frag-

ment was digested with XhoI and SacII and cloned and inserted

into each one of the three pcDNA3.1 (exons 6–8 and 9) vectors di-

gested with XhoI and SacII. The following three pcDNA3.1 (exons

6–8 and 9–11) minigene expression vectors were created: WT,

c.813-2A>G, and c.885A>G. The primers that we used are listed

in Table S1. We performed PCR amplifications by using the

EXPAND Long Template PCR system according to the manufac-

turer’s (Roche, Indianapolis, IN) protocol. We sequenced all exons

and approximately 500 base pairs of the intron-exon flanking

boundaries of the generated plasmids.

Coimmunoprecipitation AssayRelevant combinations of GST- and V5-tagged constructs for

archetype MTAP and MTAP_v1, _v2, _v3, _v4, _v5, and _v6 were

cotransfected into PC3M cells. All of the following procedures

were performed at 4�C. Cell extracts for immunoblotting were har-

vested in NP40 lysis buffer (Santa Cruz Biotechnology, standard

protocol), and insoluble material was removed by centrifugation

(10 min at 13,000 rpm). One tenth of the cell extracts were

reserved for subsequent immunoblot analysis. Cell extracts were

incubated with V5 antibody (1 mg/ml) or GST antibody (1 mg/ml)

for 4 hr and Protein A-Sepharose (Invitrogen, Carlsbad, CA). Beads

were washed three times with 1% lysis buffer, and coimmunopre-

cipitates were released by being boiled in 100 ml 23 SDS Reducing

Sample Buffer (Invitrogen, Carlsbad, CA) for 5 min. Coimmuno-

precipitates were analyzed by immunoblot analysis (see below).

Immunoblot and Densitometric AnalysisCell extracts for immunoblotting were harvested in radioimmuno-

precipitation assay (RIPA) buffer (Santa Cruz Biotechnology,

standard protocol). Equal amounts of protein (50 mg) as deter-

mined by the BioRad DC Protein quantification assay were loaded

and separated by polyacrylamide gel electrophoresis and trans-

ferred to nitrocellulose membranes. We performed immunoblot-

ting by using a goat polyclonal antibody to actin (SC-1615), a

monoclonal (0.2 mg/ml) antibody to GST (BD PharMingen), and

a monoclonal antibody to the V5 tag (Santa Cruz Biotechnology).

We analyzed enhanced chemiluminescent images of immuno-

blots by using a scanning densitometer and quantifying the bands

(BIOQUANT NOVA imaging system). All values were normalized

to actin and expressed as fold changes relative to the control.

MTA QuantificationBlood-serum and cell-extract lysates were mixed with 63 pmol

[50-2H3] MTA, neutralized with KOH, and centrifuged. MTA frac-

tions were purified by HPLC (SymmetryShield RP18 column),

concentrated, dissolved in 10% methanol with 0.1% TFA, and

subjected to LCQ ESI-MS analysis. MTA was quantitated with an

internal mass standard of [50-2H3] (301 amu) relative to the peak

area for authentic MTA (298 amu). Samples were analyzed in

triplicate.

MTAP Enzymatic Activity AssayCells were trypsinized and washed twice with PBS, and the cell

pellets were frozen at�80�C. On the day of the analysis, the pellets

were thawed on ice and resuspended in MTAP lysis buffer (20 mM

potassium phosphate, pH 7.4, 1 mM dithiothreitol, and Roche

complete, EDTA-free protease cocktail in a dilution equivalent to

1 tablet for 125 ml of lysate buffer), sonicated three times for

15 s each and cooled on ice between sonications, and centrifuged

at 13,000 3 g for 15 min at 4�C. Protein concentrations were

measured with the Bio-Rad DC Protein Assay. The MTAP-activity

assay was as described33 and had the following modifications:

Cell lysates or an enzyme blank consisting of an equal volume

of lysis buffer was preincubated at 37�C for 5 min in 86 ml of a

solution containing 116 mM potassium phosphate, 58 mM KCl,

and 0.23 mM dithiothreitol in quartz microcuvettes. The enzyme

reactions were started with the addition of 4 ml xanthine oxidase

(0.15 units) and 10 ml of 5 mM MTA. Data was collected for

616 The American Journal of Human Genetics 90, 614–627, April 6, 2012

1,300 s, and the rates were calculated by a linear fit to the data

between 800 and 1,100 s. The blank rate was subtracted from all

the enzyme rates. MTAP-assay rates were linear; the lysate protein

concentration ranged from 10 to 500 ng protein per assay. One

unit of MTAP activity is the amount of enzyme that catalyzed

the formation of 1 mmole of adenine per minute under the condi-

tions of the assay.

Results

Positional Cloning of the DMS-MFH-Associated Gene

Previously, we performed genome-wide linkage analysis

and haplotype reconstruction on three unrelated families

to establish the critical region of the DMS-MFH-associated

gene as a 3.5 cM locus in chromosomal region 9p21–22.

The centromeric boundary was defined by D9S1778, and

the telomeric end was defined by D9S171.4 Since the

original mapping, we identified two additional families

(families 4 and 5) (Figure 1). Interestingly, both of the

new families had evidence of a progressive myopathic

disease (Martignetti et al., unpublished observations34)

not previously noted in the three original families. The

disease-associated gene for one of these later multigenera-

tional families (this family was originally described by

Henry et al.35 as having a history of bone disease, patho-

logic fractures, fibrosarcoma, cataracts, and myopathy

(MIM 609940]) had been independently mapped to a

15 Mb region at 9p21–22.34 This region broadly overlap-

ped the DMS-MFH critical region. To determine whether

these two additional families affected by myopathic

disease were allelic and supported the previously identified

DMS-MFH critical region, we performed multipoint

parametric linkage analysis by using markers spanning

chromosome 9.

We analyzed all five families by using a combination of

previously described and laboratory-developed microsatel-

lite markers. A maximal combined location score of 4.27

was obtained for marker D9SB3 (Figure 2A). We then per-

formed haplotype analysis of additional markers within

the shared region and reconstruction of haplotypes to

tested/unaffected

bone dysplasia

bone tumor

Suspected by history

I

II

III

V

2 3

1

Family 4

IV 1

1

I

II

III

IV

V

6 7 8

Family 3

3

I

II

III

IV

V

VI

Family 2

7 8

1 2 3 4 5

1 4 5 6 9 13

I

II

III

IV

V

VI

Family 5

1 3 5

VII

I

II

III

IV

V

VI

Family 1

5 6

7 8 9

Figure 1. Pedigrees of the Five DMS-MFH-Affected FamiliesFamilies 1, 2, and 3 were originally described as the ‘‘American,’’ ‘‘Australian,’’ and ‘‘New York’’ families, respectively.1–3 Family 4 isa previously undescribed DMS-MFH-affected family from New York. Family 5 has been described as having autosomal-dominantbone fragility and limb-girdle myopathy.34 Family members classified as ‘‘suspected by history’’ were unavailable for radiological diag-nosis but had a history of multiple (>2) pathological fractures and/or had children whowere clinically diagnosed with DMS-MFH on thebasis of plain-film X-rays and family history.

The American Journal of Human Genetics 90, 614–627, April 6, 2012 617

further narrow the region. The minimal disease locus was

established by recombination events between markers

AL882 and D9S1749 in unaffected individual F3 IV-6 and

between markers D9S916 and D9S976 in affected indi-

vidual F3 IV-7. These events narrowed the critical region

to 1.2 Mb (Figure 2B).

DNA-sequence analysis of known candidate genes

within the original critical region had previously failed to

identify causative mutations.4 In particular, this analysis

included the known eight exons and corresponding

intron-exon boundaries of MTAP.4 Having exhausted all

known genes, we next sought and analyzed predicted

genes and putative open reading frames (ORF) from within

the region. In silico analysis of the region with the use of

the UCSC Genome Browser identified an ORF (GenBank

accession number AF216650) located 65 Kb downstream

of the known MTAP termination site within a truncated

expressed sequence tag (EST). DNA-sequence analysis of

the putative 192 bp ORF, which we termed MTAP exon 9,

and its intron-exon boundaries revealed the presence of

one of two heterozygous A>G substitutions for all affected

members of the first four DMS-MFH-affected families

(Figure 2D). Using the EST clone (GenBank accession

number AK309365) as a reference, we determined that

one mutation was a synonymous change at position

c.885A>G (p.(¼)), effectively R100R, and was present in

affected families 1, 3, and 4. The second mutation, c.813-

2A>G, was an intronic change present in affected family

A

B

21.1 Mb 22.3 MbInterferon family

P16 P15

snoxe levon PATMPATM

65 kb

1 3 11019876542

D9S925 AL624 AL882 D9S1749 D9S916 D9S976 D9SB3 D9S171186 134 202 135 274 138 142 171186 130 192 135 274 126 144 163190 134 202 135 274 138 142 171186 130 192 135 274 126 144 163166 132 194 139 272 128 157 162

132 196 149 278 126 157 170166 132 194 127 266 126 157 170190 132 196 149 278 126 155 170166 132 194 139 272 126 142 170182 126 198 139 278 130 142 170168 132 197 140 274 130 143 164197 126 199 118 280 128 145 162168 132 197 140 274 130 145 164192 134 199 142 232 136 145 156

FAMILY 4

IV-6

IV-7

IV-2

V-1

FAMILY 2

FAMILY 3

IV-8

VI-4

VI-5

NA

Unaffected

Affected

Co

mb

in

ed

Lo

ca

tio

n

Sc

ore

D

Position in Haldane

D9S

925

D9S

1749

D9S

916

D9S

790

D9S

B3

D9S

932M

D9S

171

D9S

1121

D9S

1118

D9S

304

D9S

301

D9S

1124

D9S

1122

D9S

922

D9S

303

D9S

252

D9S

906

D9S

910

D9S

2026

D9S

915

D9S

930

D9S

302

D9S

907

D9S

918

D9S

934

D9S

921

A>G A>G

C

E

c.813-2 c.885

Figure 2. Identification of the Critical Region and the DMS-MFH-Associated Gene(A) The results of a combined multipoint parametric linkage analysis for families 2, 3, and 4. The maximal location score is 4.27 atmarker D9SB3.(B) Haplotype analysis of informative individuals from families 2, 3, and 4 narrowed the DMS-MFH critical region (boxed).(C) Physical map of the DMS-MFH critical region, which spans 1.3Mb between the flankingmarkers AL882 and D9S976. Arrows indicatetranscriptional direction of candidate genes. The exon-intron structure of MTAP highlights the eight exons of MTAP (green) and thethree terminal exons (blue). Maps are not drawn to scale.(D) Representative DNA-sequence chromatograms from selected affected and unaffected individuals from each family.(E) Tumor DNA sequence revealing homozygosity of the diseased allele and loss of the unaffected allele.

618 The American Journal of Human Genetics 90, 614–627, April 6, 2012

2. The sequence changes segregated appropriately with

the disease phenotype within all respective family

members in each family. We then analyzed affected and

unaffected individuals from family 5; all affected individ-

uals possessed the c.813-2A>Gmutation. To test the possi-

bility that these changes represented polymorphisms and

not pathogenic mutations, we screened a control popula-

tion. Neither mutation was identified in 1,000 chromo-

somes from 500 unaffected control individuals. Similarly,

the mutations were not present in dbSNP build 131.

The DMS-MFH Mutation Is Homozygously Present

in a Patient-Derived Osteosarcoma

MFH is a rare, highly aggressive bone tumor of uncertain

histogenesis but whose histologic appearance, treatment,

and response are similar to those of osteosarcoma.36,37

Indeed, the diagnosis of MFH has been controversial, and

in cases where osteoid is present, the diagnosis of osteosar-

coma is favored.38 Approximately one-third of affected

individuals within our families developed bone sarcomas

arising between the second and fifth decades of life. The

diagnoses were either MFH3 or bone fibrosarcoma.1 As

shown in Figure 3, given the presence of osteoid, the histo-

pathological analysis of a tumor from a DMS-MFH-affected

individual (III-3 from family 4; c.885A>G) is consistent

with the diagnosis of osteosarcoma. Thus, inherited

MTAP alternative-splicing mutations can result in

histology-proven osteosarcoma.

Moreover, and in agreement with Knudson’s two-hit

hypothesis for a tumor-suppressing gene,39 direct

sequencing of this patient’s osteosarcoma genomic DNA

demonstrated homozygosity for the c.885A>G mutation

(Figure 2E). LOH analysis withmicrosatellite markers span-

ning the originally defined 2.9 Mb DMS-MFH critical

region revealed complete loss of the WT allele from the

unaffected chromosome (data not shown).

DMS-MFH Mutations Result in Exon Skipping and

Altered Expression of MTAP Isoforms

Given the identification of these disease-specific genomic

DNA mutations, we sought to understand the relationship

between the previously uncharacterized ninth exon and

the MTAP RNA transcripts. Therefore, we performed 30

RACE on the total RNA isolated from control and patient-

derived fibroblast and lymphoblast cell lines and also

from a patient-derived tumor cell line that we established.

For 30 RACE, we used intron-spanning primers anchored

in exons 5 and 6 of the WT gene and sequenced the result-

ing cDNA products. Using this approach, we identified, as

expected, the archetype MTAP transcript and, in accord

with our hypothesis, additional isoforms containing exon

9. In total, six additional isoforms were identified; none

contained the WT terminal exon 8, and all affected the C

terminus of the protein product in different ways

(Figure S1). Four contained either a short (9S; 103 nt) or

long (9L; 192 nt) form of exon 9. Additionally, four of the

isoforms contained a unique sequence that was mappable

to two additional downstream exons, 10 and 11 (Figure 2C

and Figure S1). On the basis of the electrophoretic mobility

of the MTAP transcripts generated, we named the six alter-

native splice variants MTAP_v1 (exons 1–7 and 9S–11),

_v2 (exons 1–7 and 9L), _v3 (exons 1–7, 10, and 11), _v4

(exons 1–6 and 9S–11), _v5 (exons 1–6 and 9L), and _v6

(exons 1–6, 10, and 11). Splice variants 1–3 contained the

WT exon 7 sequence; variants 4–6 did not.

The two DMS-MFH mutations, one intronic and the

other exonic, did not predict amino acid changes. We

hypothesized that one possible pathogenic mechanism

could be through an effect on alternative splicing. We

analyzed the DNA sequence for the presence of known

Figure 3. Histopathological Analysis of an Osteosarcoma fromPatient F4 IV-2(A) Histologic analysis revealed that 95% of the studied tumorspecimen displayed the typical pattern of malignant spindle(fibroblastic) cells of bone MFH.(B) Malignant cells forming neoplastic bone.(C) Focal sheets of neoplastic bone within the tumor are shown tobe entrapping pre-existing bone trabeculae, and the overtly malig-nant cells are shown to produce bone. All sections were stainedwith hematoxylin and eosin.

The American Journal of Human Genetics 90, 614–627, April 6, 2012 619

donor and acceptor splice motifs and intronic and exonic

cis elements that could direct splice-site identification.

The intronic c.813-2A>G mutation was predicted to result

in the loss of a canonical splice acceptor site.31 The

c.885A>G transition abolished a predicted exonic splicing

enhancer (ESE) sequence.32

To validate these in silico findings, we generated

three MTAP minigene constructs, WT, c.813-2A>G, and

c.885A>G, containing ~11.5 Kb of genomic sequence

and differing at only the single nucleotide position being

interrogated (Figure 4A). The minigene constructs were

transiently transfected into MCF7 cells, and the relative

expression levels of each isoform were determined by

qRT-PCR. The identity of each transcript was established

by RT-PCR, subcloning, and the sequencing of at least 20

independent clones from gel-isolated bands.

The three differentMTAP constructs revealed clear differ-

ences in the expression pattern of splice variants, whereas

A

B

C

A>G

A>G

6 7 8 9 10 11

11.5 kb

Wild-Type

AffectedUnaffected

wtMTAP

MTAP_v4

MTAP_v1

MTAP_v6

MTAP_v3

6 7 8

6 7 10 11

6 9S 10 11

6 10 11

6 7 8

6 7 10 11

6 10 11

Ex6F/Ex8R

Ex6F/Ex11R

-actin

6 7 9S 10 11

MTAP_v1

0

2

4

C -2 +72

***

***

MTAP_v6

MTAP_v2

0.6

1.2

C -2 +720

****

MTAP_v5

****

0

0.6

1.2

********

0.6

1.2

0

****

****

FC

0

0.6

1.2

C -2 +72

****

MTAP_v4

****

0

15

30****

****

MTAP_v3wtMTAP

FC

0

1

2

Ex6F/Ex9LRMTAP_v2

MTAP_v5

6 7 9L

6 9L

C -2 +72

****

β

c.813-2

c.885

c.813

-2c.8

85

CTTTAG

CTTTGG

ACAGAGGA

ACAGGGGA

Figure 4. Patient-Derived MTAP Mutations Result in Exon Skipping(A) Sequence differences of the three minigene constructs, WT, c.813-2A>G, and c.885A>G, are highlighted.(B) Schematic representation of the major sequence-verified isoforms flanking the electrophoretic profile of each minigene construct.Arrowheads above exons depict the positions of translational stop codons. The WT construct expresses all alternative splice variantsthat are detectable with this combination of primers (exon 6 forward [Ex6F] and exon 11 reverse [Ex11R]) (left), whereas patient-derivedmutant constructs, c.813-2A>G and c.885A>G, expressed relatively very low levels of exon-9S- and -9L-containing variants, v1/v4 andv2/v5, respectively. By comparison, the expression of isoforms v3 and v6 was markedly elevated in both mutant constructs.(C) qRT-PCR analysis of archetype MTAP and isoform expression in cells that express each of the minigene constructs. Expression anal-ysis of the three minigene constructs demonstrated that the c.813-2A>Gmutant construct resulted in significantly increased archetypeMTAP expression levels, whereas both the c.813-2A>G and c.885A>G mutant constructs were associated with an absence of and/orsignificantly decreased levels of MTAP_v1, _v2, _v4, and _v5. Bothmutant constructs resulted in significantly increased expression levelsof MTAP_v3 and _v6. The following abbreviation is used: wtMTAP, archetype MTAP. The error bars represent the averages of three inde-pendent experiments.

620 The American Journal of Human Genetics 90, 614–627, April 6, 2012

expression of the WT form was unchanged (Figure 4B).

Although the WT gene construct directed expression of all

seven MTAP transcripts, the single-nucleotide changes at

c.813-2A>G and c.885A>G resulted in markedly decreased

expression of the four exon-9-containing transcripts,

namely MTAP_v1, _v2, _v4, and _v5. In contrast, there

was a significant increase in the two isoforms lacking exon

9, namely MTAP_v3 and _v6. Differences were noted in

the transcription effects of the two mutations. Quantifica-

tion of the resultant isoforms by qRT-PCR demonstrated

that the c.813-2A>G mutation ablated expression of all

isoforms containing exon 9 and significantly increased the

expression of archetype MTAP (Figure 4C). In addition, the

first three amino acids were lacking from both 9S isoforms.

The c.885A>G mutation decreased the expression levels

of all exon 9 isoforms by approximately 70% but had no

effect on archetype-MTAP expression. Both patient-derived

mutations significantly increased the expression of

MTAP_v3 and _v6 to the same degree (Figure 4C).

The mutation-dependent isoform expression pattern

that we identified by using the engineered minigene

expression constructs was also present in patient-derived

tissues. As shown in Figure 5, patient-derived fibroblast

and lymphoblast cell lines, constitutively heterozygous

for the mutations, demonstrated MTAP isoform expression

patterns similar to each other but markedly different from

the expression patterns of control cells (Figures 5A and 5B).

Quantitative RT-PCR analysis revealed that in patient-

derived cells, expression levels of MTAP_v1 and expression

levels of MTAP_v3 and _v6 were approximately 50% lower

and nearly five to ten times higher, respectively, than those

of the controls (Figures 5A and 5B). In both these cell-line

types, we were unable to consistently detect quantifiable

levels of MTAP_v2, _v4, and _v5 (results not shown).

MTAP Splice Variants Can Physically Interact with

Archetype MTAP and Are Biologically Active

We began exploring the function of the different MTAP

splice variants by first determining their respective protein

stabilities. The transient transfection of expression vectors

(which contained each isoform fused with a 30 V5 and/or

a 50 GST terminal tag) into MCF7 cells demonstrated that

Figure 5. Dysregulated MTAP Splicing Patterns in Patient-Derived Cell LinesMTAP expression patterns in patient-derived and control fibroblast (A) and lymphoblast (B) cell lines. Semiquantitative RT-PCR analysisis on the left, and qRT-PCR analysis of archetype MTAP is on the right.(A) C1, C2, C3, and C4 are controls. Fibroblast and tumor cell lines are from affected individual F4 III-3.(B) C1 and C2 are controls. Lymphoblast cell lines are from individuals F3 IV-5, F3 IV-6, F3 IV-7, F4 III-5, and F4 III-6. The followingabbreviation is used: wtMTAP, archetype MTAP.

The American Journal of Human Genetics 90, 614–627, April 6, 2012 621

all six MTAP splice variants were translated. Although RNA

levels were essentially equivalent (data not shown), there

was, however, a large variability in protein expression

levels (Figure S2). Archetype-MTAP and MTAP_v1, _v2,

and _v3 proteins (i.e., those containing exon 7) were

expressed at comparable levels, whereas MTAP_v4, _v5,

and _v6 were expressed at markedly lower levels.

These findings suggest a possible difference in protein

stability that might be regulated through the proteosome

pathway. To test this hypothesis, we treated the transfected

cells with the proteosome inhibitor MG132. The v4, v5,

and v6 isoforms accumulated, whereas archetype MTAP

and variants v1, v2, and v3 were relatively unaffected

(Figure S2A). The half-life of each splice variant was esti-

mated by cyclohexamide treatment. The half-life of arche-

type MTAP and MTAP_v1, _v2, and _v3 was R12 hr. It was

significantly less than 6 hr for MTAP_v4, _v5, and _v6

(Figure S2B).

Given these findings and the knowledge that MTAP

exists as a trimeric40–42 protein complex, we tested the

ability of the MTAP splice variants to physically interact

with archetype MTAP. We performed coimmuprecipita-

tions on cell extracts from cotransfected PC3M cells by

using combinations of V5- andGST-tagged fusion proteins.

All MTAP splice variants were able to physically interact

with archetype MTAP (Figure S3).

We next sought to determine whether MTAP splice vari-

ants have MTAP activity. We directly measured the ability

of each isoform to convert MTA to adenine by using a

biochemical assay after we stably transfected them into

two MTAP-null cell lines, MCF7 and MNNG-HOS (a

human-osteosarcoma-derived cell line). Cellular lysates

from mock-transfected cells demonstrated negligible

MTAP activity. Against this null-activity background,

only MTAP isoforms v1, v2, and v3 demonstrated MTAP

activity (Figure S4). Given that variants v4, v5, and v6

had appreciably shorter half-lives, we also performed these

studies in the presence of MG132. Despite the presence of

these isoforms, as determined by immunoblot analysis, we

were unable to detect MTAP activity within the limits of

our assay system (data not shown).

MTA Serum Levels Are Increased in DMS-MFH

MTA is not normally present in human serum. Cells lack-

ing MTAP activity are unable to metabolize MTA,43 and

functional inhibition44 or dysregulation of MTAP activity

would therefore be expected to result in intracellular MTA

accumulation and secretion. When tumor cells are MTAP

deficient, excess MTA would be expected to be cleared by

surrounding MTAP-normal stromal cells. If MTAP expres-

sion or activity is globally affected in all tissues, for

example, in an individual with an inherited germline

MTAP deficiency, then serum levels of MTA should be

increased. To establish whether decreased expression of

exon-9-containing isoforms affects MTAP activity, we

measured MTA serum levels in DMS-MFH-affected family

members and controls. Serum samples from two affected

adults (F4 III-1 and F4 IV-1), one unaffected familymember

(F4 IV-3), and two unrelated controls were analyzed in a

blinded fashion. All three serum samples from unaffected

individuals had no detectable MTA levels. In marked

contrast, both affected individuals had accumulations of

MTA detectable in their serum (Table 1).

Molecular Modeling of MTAP Isoforms

The high-resolution structure of human archetype MTAP

has been previously determined by X-ray crystallography

with several ligands. Most relevant is the trimeric enzyme

complex including 50-deoxy-50-methylthioadenosine sul-

fate (PDB ID 1CG6) and the MTAP apoprotein (PDB ID

1CB0).36 These structures of human 50-deoxy-50-methyl-

thioadenosine phosphorylase at 1.7 A resolution provide

insights into substrate binding (Figure 6A) and catalysis

and serve as a template for modeling potential interactions

among the subunits. On the basis of this model, the amino

acid sequences providing the substrate binding site are en-

coded primarily by exons 6 and 7. Only the L279 residue is

provided by exon 8 (Figure S5). The structure reveals that

each of the three identical subunits is comprised of an

alpha-beta domain containing an eight-stranded and a

five-stranded mixed beta sheet with six dispersed alpha

helices similar to the family of purine nucleoside phos-

phorylases (Figure 6B). On the basis of the results of our

coimmunoprecipitation experiments, the MTAP trimer

could exist as a heterologous assembly of different subunits

comprised of archetype and splice variants. In the MTAP

splice variants, the 75 amino acids of v1 are inserted after

K271, and those of v4 are inserted after A230. In the

MTAP monomer, v1–v3 and v4–v6 oppose each other,

but most importantly, at the interfaces of different sub-

units of the trimer, all of the splice variants are relatively

close to each other spatially, regardless of in which exon

amino acids are inserted within the symmetrically equiva-

lent subunits. As seen in Figure 6C, exon 6 is in juxtaposi-

tion with exon 7 of the adjacent subunit. The trimeric

subunit interface of MTAP does appear to be affected by

the alternate splicing events or, possibly, the MTA active

site (Figure 6C). Secondary-structure and disulfide-bond

prediction analyses predict the generation of a disulfide

bond between cysteine residues in exons 9 and 10; these

prediction analyses require future biochemical analysis

for validation.

Table 1. MTA Serum Levels

Serum Donor MTA Level (pmol/100 ul)

F4 III-1 11.5

F4 IV-2 4.3

F4 IV-3 not detected

Control 1 not detected

Control 2 not detected

622 The American Journal of Human Genetics 90, 614–627, April 6, 2012

Ancestral Retroposition Events Gave Rise to Specific

MTAP Exons in Humans and Possibly Other

Anthropoid Primates

Sequence analysis of the three terminal exons revealed that

exons 9 and 10 shared high homology with different

primate-specific retroviral sequences, which are known to

have integratedmultiple times into different chromosomes

throughout the genome. In general, human endogenous

retrovirus (HERV) sequences are the result of ancestral infec-

tion events that can become incorporated into the genome

and transmitted vertically through speciation events.45 As

shown in Figure 7A, an analysis (with the use of programs

RepeatMasker andRetrosearch) of exons 9 and10 suggested

that these exons arose from integrated retroviruses from

distinct subfamilies. Exon 9 arose from part of a MER50I

element, and exon 10 arose from part of a THE1A element,

one of several families of primate-specific long terminal

repeat (LTR) retrotransposons.46 Both the c.813-2A>G

and c.885A>Gmutations in exon 9 represent G>A nucleo-

tide transitions from the consensus MER50I sequence.

To more precisely establish the evolutionary age of exon

9, which contains the DMS-MFH mutations, we amplified

and sequenced this exon and its intron-exon boundaries

from a panel of primate genomic DNA. PCR amplicons

were obtained and sequenced for confirmation in great

apes and Old and New World monkeys. No product was

amplified from the ring-tailed lemur, a member of an

ancestral extant primate lineage (Figure 7). From this

analysis, we could determine that the MER50I remnant,

which now encodes MTAP exon 9, was integrated over

40 million years ago into the lineage leading to anthropoid

primates.

Discussion

Hereditary cancer syndromes represent a powerful and

tractable biologic system for identifying cancer-causing

mutations.47 Although the syndromes themselves are

rare, their study can provide insight into the basis of the

sporadic forms of the cancers. DMS-MFH represents the

only known hereditary form of MFH, which has been

thought to exist along a spectrum of bone sarcomas with

osteosarcoma. Osteosarcoma itself has been linked to

several hereditary disorders, including Li-Fraumeni

syndrome (MIM 151623), Rothmund-Thomson syndrome

(MIM 268400), Bloom syndrome (MIM 210900), Werner

syndrome (MIM 277700), Paget disease (MIM 602080),

and retinoblastoma, albeit the incidence of cancer devel-

opment associated with these syndromes is lower when

compared to the >30% associated with DMS-MFH.

Recently, one of our affected individuals developed histo-

logically proven osteosarcoma (Figure 3), thus further sup-

porting a genetic link between these tumor types.

The 9p21 region containing theMTAP locus is one of the

most frequently deleted and/or translocated chromosomal

regions in human cancer.5 The facts that MTAP is more

complex than previously recognized and that its terminal

coding exon lies within 25 kb of the p15/p16 locus has

immediate significance to LOHmapping and copy number

variation (CNV) studies in human cancer. Deletions

including the p15/p16 locus will more than likely also

include the 30 region of MTAP and therefore might

affect MTAP biochemical activity. Thus, the interpretation

of many of these studies with regard to the genes

being affected should be reevaluated. Ultimately, the

Figure 6. Molecular Modeling of MTAPModeling is based on the coordinates of the trimeric enzyme complex including 50-deoxy-50-methylthioadenosine sulfate (PDB ID1CG6) and the MTAP apoprotein (PDB ID 1CB0).42 Rendering was done with the program O v.13 (courtesy of Dr. Alwyn Jones, Univer-sity of Uppsala, Uppsala, Sweden), and visualization was done with the PyMOL molecular graphics system v.1.3 (Delano Scientific).(A) Overview of the top of the MTAP trimer. The stick representation (arrow) indicates the MTA substrate.(B) View of an MTAP monomer. The insertion positions of exon 7 (salmon) and exon 6 (tan) are at opposite ends.(C) View of the bottom of the MTAP trimer. Subunits and the insertion positions of exon 7 (salmon) and exon 6 (tan; discontinuouselectron density) are indicated. Displayed is the close proximity of one of these junctions: exon 7 in subunit 2 with exon 8 in subunit3. Because of the three-fold symmetry, there is the possibility of three splice-variant insertion points in the trimer.

The American Journal of Human Genetics 90, 614–627, April 6, 2012 623

identification of the MTAP splice variants and the fact that

their genetic loss results in DMS-MFH provide a possible

in vivo demonstration that MTAP can act as a tumor

suppressor. Given our findings that the DMS-MFH muta-

tions also result in overexpression of two splice variants,

MTAP_v3 and _v6, the possibility that at least two of the

MTAP isoforms could represent oncogenic variants must

also be considered at this time.

Of possible related significance, the 9p21 chromosomal

region containing the full-length MTAP has been linked

to coronary artery disease (CAD [MIM 611139]) and

myocardial infarction in three independent genome-wide

association studies (GWASs).48–50 Because of a paucity of

known transcripts in the linkage-disequilibrium block,

the region has been viewed as a ‘‘gene desert.’’MTAPmight

represent an intriguing heart-disease candidate gene for

several reasons. First, in DMS-MFH-affected family 1, two

male family members died of heart disease in their early

forties without other known risk factors. A third family

member has been recently diagnosed with early CAD

(J.A.M., unpublished data). As such, CAD might represent

a previously unrecognized aspect of the disease phenotype

in this syndrome. Second, two of our families have been

diagnosed with myopathic disease and have features over-

lapping the symptoms of facioscapulohumeral muscular

dystrophy and limb-girdle muscular dystrophy. Each of

these chronic myopathic disorders is associated with an

increased risk of heart disease.51,52 Third, defects in poly-

amine metabolism have been associated with defects in

angiogenesis53 and altered myocyte function,54,55 whereas

a nearly pathognomonic feature of DMS-MFH bone

dysplasia is the presence of scattered infarctions through-

out the medullary cavity.4 Finally, the CAD risk alleles

identified in the original GWASs have been shown to

localize to a STAT1-dependent enhancer element that can

interact with MTAP and affect its transcription.56 Future

studies will be required for exploring the possible role(s)

of MTAP in CAD.

An intriguing evolutionary aspect of these studies is the

origin of the three terminal MTAP exons. Mammalian

chromosomes are interspersed with remnants of ancient

retroviral integration events. It is estimated that 8% of

the human genome consists of retroelements containing

LTRs that flank regions corresponding to the gag, pol, and

env genes.57 Although they usually infect somatic cells,

retroviruses can also infect germ cells and establish perma-

nent residence and future vertical transmission through

a species in a Mendelian fashion.46 Indeed, the majority

of HERVs are present in apes and Old World monkeys,

suggesting that their original integrations took place

more than 25 million years ago,58 and evidence now

exists that some HERV proteins have been co-opted into

functional roles.59 Two of the clearest examples are

provided by the expression of two fusogenic proteins by

Figure 7. Evolution of MTAP Exon 9(A) Schematic representation demonstrating that exons 9 and 10 arose from distinct families of retroviral integration events.(B) The PCR results of exon 9 in a primate genomic DNA panel are superimposed adjacent to a phylogenetic tree. The results demonstratethat exon 9 was integrated into the primate genome at some point in evolution between the divergence of the ring-tailed lemur and thecommon woolly monkey approximately 40 million years ago.

624 The American Journal of Human Genetics 90, 614–627, April 6, 2012

trophoblasts. Syncytin 1 had been derived from a HERV-W

envelope glycoprotein, is expressed in all trophoblastic

cells, and mediates trophoblast cell fusion into the multi-

nucleated syncytiotrophoblast layer. Syncytin 2, derived

from a HERV-FRD envelope glycoprotein, is expressed in

human placenta.60–62 Our studies reveal that two of the

three terminal exons of MTAP are derived from indepen-

dent retroviral integration events during primate evolu-

tion. Although the integration of exon-9-associated

material most likely occurred after prosimian and New

World monkey divergence more than 40 million years

ago (Figure 7), the timing of its being functionally co-opted

is presently unknown.

The retroviral origins of the terminal exons of MTAP

suggest a number of interesting a priori conclusions. First,

given the evolutionary species restriction of these HERVs,

the existence of MTAP variants and possible biochemical

regulation resulting from their expression must be unique

to primates. Second, the exon 9 mutations that result in

DMS-MFH highlight not only the existence of MTAP

isoforms but also that their heterozygous loss (v1, v2, v4,

and v5) and/or overexpression (v3 and v6) results in

disease. This functionally demonstrates the importance

of the acquired domains in MTAP function and their fixa-

tion into normal human physiology. This phenomenon,

the recruitment and fixation over evolutionary time of a

nucleic-acid sequence as a functional gene, or part thereof,

is termed exaptation, and a number of examples are

known.63–65 However, we are unaware of any other

example wherein the loss of a co-opted gene and/or

protein domain results in a disease phenotype.

Finally, our results continue to demonstrate the dynamic

nature of the genome and the dual reality of disease asso-

ciations and beneficial implications for both alternative

splicing and retroelements. Future studies will now be

required for defining the exact biochemical function(s)

and regulation of these isoforms, their association with

bone dysplasia, cancer initiation, and CAD, and their inter-

action with archetype MTAP.

Supplemental Data

Supplemental Data include five figures and one table and can be

found with this article online at http://www.cell.com/AJHG.

Acknowledgments

The authors gratefully acknowledge the families that participated

in this study. We also thank D. Springfield and J. Brosius for their

thoughtful comments about the patients and manuscript, respec-

tively, and X. Xu, J. Solis, J. Moorjani, C. Meret, S. Tam, and

N. Kham for technical assistance.

Received: October 14, 2011

Revised: January 19, 2012

Accepted: February 16, 2012

Published online: March 29, 2012

Web Resources

The URLs for data presented herein are as follows:

ESEfinder Release 3.0, http://rulai.cshl.edu/cgi-bin/tools/ESE3/

esefinder.cgi

Marshfield database, http://research.marshfieldclinic.org/genetics/

GeneticResearch/compMaps.asp

NetGene2 Server, http://www.cbs.dtu.dk/services/NetGene2/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

RepeatMasker, http://www.repeatmasker.org/

Retrosearch, http://www.daimi.au.dk/~biopv/herv/

UCSC Genome Browser, March 2006 assembly, http://genome.

ucsc.edu/

Accession Numbers

The EMBL-Bank accession numbers for the translated protein

MTAP variants reported in this paper are: HE654772 (MTAP_v1),

HE654773 (MTAP_v2), HE654774 (MTAP_v3), HE654775

(MTAP_v4), HE654776 (MTAP_v5), and HE654777 (MTAP_v6).

References

1. Arnold, W.H. (1973). Hereditary bone dysplasia with sarcoma-

tous degeneration. Study of a family. Ann. Intern. Med. 78,

902–906.

2. Hardcastle, P., Nade, S., and Arnold, W. (1986). Hereditary

bone dysplasia with malignant change. Report of three fami-

lies. J. Bone Joint Surg. Am. 68, 1079–1089.

3. Norton, K.I., Wagreich, J.M., Granowetter, L., andMartignetti,

J.A. (1996). Diaphyseal medullary stenosis (sclerosis) with

bone malignancy (malignant fibrous histiocytoma): Hardcas-

tle syndrome. Pediatr. Radiol. 26, 675–677.

4. Martignetti, J.A., Desnick, R.J., Aliprandis, E., Norton, K.I.,

Hardcastle, P., Nade, S., and Gelb, B.D. (1999). Diaphyseal

medullary stenosis with malignant fibrous histiocytoma: A

hereditary bone dysplasia/cancer syndrome maps to 9p21-

22. Am. J. Hum. Genet. 64, 801–807.

5. Mitelman, F. (1994). Catalog of Chromosome Aberrations in

Cancer (New York: Wiley/Liss).

6. Miyakoshi, J., Dobler, K.D., Allalunis-Turner, J., McKean, J.D.,

Petruk, K., Allen, P.B., Aronyk, K.N., Weir, B., Huyser-Wier-

enga, D., Fulton, D., et al. (1990). Absence of IFNA and IFNB

genes from human malignant glioma cell lines and lack of

correlation with cellular sensitivity to interferons. Cancer

Res. 50, 278–283.

7. Olopade, O.I., Jenkins, R.B., Ransom, D.T., Malik, K., Pomy-

kala, H., Nobori, T., Cowan, J.M., Rowley, J.D., and Diaz,

M.O. (1992). Molecular analysis of deletions of the short

arm of chromosome 9 in human gliomas. Cancer Res. 52,

2523–2529.

8. Fountain, J.W., Karayiorgou, M., Ernstoff, M.S., Kirkwood,

J.M., Vlock, D.R., Titus-Ernstoff, L., Bouchard, B., Vijayasar-

adhi, S., Houghton, A.N., Lahti, J., et al. (1992). Homozygous

deletions within human chromosome band 9p21 in mela-

noma. Proc. Natl. Acad. Sci. USA 89, 10557–10561.

9. Lukeis, R., Irving, L., Garson, M., and Hasthorpe, S. (1990).

Cytogenetics of non-small cell lung cancer: Analysis of consis-

tent non-random abnormalities. Genes Chromosomes Cancer

2, 116–124.

The American Journal of Human Genetics 90, 614–627, April 6, 2012 625

10. Diaz, M.O., Ziemin, S., Le Beau, M.M., Pitha, P., Smith, S.D.,

Chilcote, R.R., and Rowley, J.D. (1988). Homozygous deletion

of the alpha- and beta 1-interferon genes in human leukemia

and derived cell lines. Proc. Natl. Acad. Sci. USA 85, 5259–

5263.

11. Diaz, M.O., Rubin, C.M., Harden, A., Ziemin, S., Larson, R.A.,

Le Beau, M.M., and Rowley, J.D. (1990). Deletions of inter-

feron genes in acute lymphoblastic leukemia. N. Engl. J.

Med. 322, 77–82.

12. Garcıa-Castellano, J.M., Villanueva, A., Healey, J.H., Sowers,

R., Cordon-Cardo, C., Huvos, A., Bertino, J.R., Meyers, P.,

and Gorlick, R. (2002). Methylthioadenosine phosphorylase

gene deletions are common in osteosarcoma. Clin. Cancer

Res. 8, 782–787.

13. Miyazaki, S., Nishioka, J., Shiraishi, T., Matsumine, A., Uchida,

A., and Nobori, T. (2007). Methylthioadenosine phosphory-

lase deficiency in Japanese osteosarcoma patients. Int. J. On-

col. 31, 1069–1076.

14. Martignetti, J.A., Gelb, B.D., Pierce, H., Picci, P., and Desnick,

R.J. (2000). Malignant fibrous histiocytoma: Inherited and

sporadic forms have loss of heterozygosity at chromosome

bands 9p21-22-evidence for a common genetic defect. Genes

Chromosomes Cancer 27, 191–195.

15. Nobori, T., Takabayashi, K., Tran, P., Orvis, L., Batova, A., Yu,

A.L., and Carson, D.A. (1996). Genomic cloning of methyl-

thioadenosine phosphorylase: A purine metabolic enzyme

deficient in multiple different cancers. Proc. Natl. Acad. Sci.

USA 93, 6203–6208.

16. Kamatani, N., Nelson-Rees, W.A., and Carson, D.A. (1981).

Selective killing of human malignant cell lines deficient in

methylthioadenosine phosphorylase, a purine metabolic

enzyme. Proc. Natl. Acad. Sci. USA 78, 1219–1223.

17. Trackman, P.C., and Abeles, R.H. (1981). The metabolism of

1-phospho-5-methylthioribose. Biochem. Biophys. Res. Com-

mun. 103, 1238–1244.

18. Trackman, P.C., and Abeles, R.H. (1983). Methionine

synthesis from 50-S-Methylthioadenosine. Resolution of

enzyme activities and identification of 1-phospho-5-S methyl-

thioribulose. J. Biol. Chem. 258, 6717–6720.

19. Kamatani, N., Yu, A.L., and Carson, D.A. (1982). Deficiency of

methylthioadenosine phosphorylase in human leukemic cells

in vivo. Blood 60, 1387–1391.

20. Schmid, M., Malicki, D., Nobori, T., Rosenbach, M.D., Camp-

bell, K., Carson, D.A., and Carrera, C.J. (1998). Homozygous

deletions of methylthioadenosine phosphorylase (MTAP) are

more frequent than p16INK4A (CDKN2) homozygous dele-

tions in primary non-small cell lung cancers (NSCLC). Onco-

gene 17, 2669–2675.

21. Stevens, A.P., Spangler, B., Wallner, S., Kreutz, M., Dettmer, K.,

Oefner, P.J., and Bosserhoff, A.K. (2009). Direct and tumor

microenvironment mediated influences of 50-deoxy-50-(meth-

ylthio)adenosine on tumor progression of malignant mela-

noma. J. Cell. Biochem. 106, 210–219.

22. Karikari, C.A., Mullendore, M., Eshleman, J.R., Argani, P.,

Leoni, L.M., Chattopadhyay, S., Hidalgo, M., and Maitra, A.

(2005). Homozygous deletions of methylthioadenosine phos-

phorylase in human biliary tract cancers. Mol. Cancer Ther. 4,

1860–1866.

23. Christopher, S.A., Diegelman, P., Porter, C.W., and Kruger, W.D.

(2002). Methylthioadenosine phosphorylase, a gene frequently

codeleted with p16(cdkN2a/ARF), acts as a tumor suppressor in

a breast cancer cell line. Cancer Res. 62, 6639–6644.

24. Subhi, A.L., Tang, B., Balsara, B.R., Altomare, D.A., Testa, J.R.,

Cooper, H.S., Hoffman, J.P., Meropol, N.J., and Kruger, W.D.

(2004). Loss of methylthioadenosine phosphorylase and

elevated ornithine decarboxylase is common in pancreatic

cancer. Clin. Cancer Res. 10, 7290–7296.

25. Huang, H.Y., Li, S.H., Yu, S.C., Chou, F.F., Tzeng, C.C., Hu,

T.H., Uen, Y.H., Tian, Y.F., Wang, Y.H., Fang, F.M., et al.

(2009). Homozygous deletion of MTAP gene as a poor prog-

nosticator in gastrointestinal stromal tumors. Clin. Cancer

Res. 15, 6963–6972.

26. Mukhopadhyay, N., Almasy, L., Schroeder, M., Mulvihill,W.P.,

and Weeks, D.E. (2005). Mega2: Data-handling for facilitating

genetic linkage and association analyses. Bioinformatics 21,

2556–2557.

27. Sobel, E., and Lange, K. (1996). Descent graphs in pedigree

analysis: Applications to haplotyping, location scores, and

marker-sharing statistics. Am. J. Hum. Genet. 58, 1323–1337.

28. Benson, G. (1999). Tandem repeats finder: A program to

analyze DNA sequences. Nucleic Acids Res. 27, 573–580.

29. Dowling, O., Difeo, A., Ramirez, M.C., Tukel, T., Narla, G.,

Bonafe, L., Kayserili, H., Yuksel-Apak, M., Paller, A.S., Norton,

K., et al. (2003). Mutations in capillary morphogenesis gene-2

result in the allelic disorders juvenile hyaline fibromatosis

and infantile systemic hyalinosis. Am. J. Hum. Genet. 73,

957–966.

30. Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J.,

Rouze, P., and Brunak, S. (1996). Splice site prediction in Ara-

bidopsis thaliana pre-mRNA by combining local and global

sequence information. Nucleic Acids Res. 24, 3439–3452.

31. Brunak, S., Engelbrecht, J., and Knudsen, S. (1991). Prediction

of human mRNA donor and acceptor sites from the DNA

sequence. J. Mol. Biol. 220, 49–65.

32. Cartegni, L., Wang, J., Zhu, Z., Zhang, M.Q., and Krainer, A.R.

(2003). ESEfinder: A web resource to identify exonic splicing

enhancers. Nucleic Acids Res. 31, 3568–3571.

33. Savarese, T.M., Crabtree, G.W., and Parks, R.E., Jr. (1981).

50-Methylthioadenosine phosphorylase-L. Substrate activity

of 50-deoxyadenosine with the enzyme from Sarcoma 180

cells. Biochem. Pharmacol. 30, 189–199.

34. Watts, G.D., Mehta, S.G., Zhao, C., Ramdeen, S., Hamilton,

S.J., Novack, D.V., Mumm, S., Whyte, M.P., Mc Gillivray, B.,

and Kimonis, V.E. (2005). Mapping autosomal dominant

progressive limb-girdle myopathy with bone fragility to chro-

mosome 9p21-p22: A novel locus for a musculoskeletal

syndrome. Hum. Genet. 118, 508–514.

35. Henry, E.W., Auckland, N.L., McINTOSH, H.W., and Starr, D.E.

(1958). Abnormality of the long bones and progressive

muscular dystrophy in a family. Can. Med. Assoc. J. 78,

331–336.

36. Picci, P., Bacci, G., Ferrari, S., and Mercuri, M. (1997). Neoad-

juvant chemotherapy in malignant fibrous histiocytoma of

bone and in osteosarcoma located in the extremities: Analo-

gies and differences between the two tumors. Ann. Oncol. 8,

1107–1115.

37. Jeon, D.G., Song, W.S., Kong, C.B., Kim, J.R., and Lee, S.Y.

(2010). MFH of Bone and Osteosarcoma Show Similar

Survival and Chemosensitivity. Clin. Orthop. Relat. Res. 469,

584–590.

38. Ghandur-Mnaymneh, L., Zych, G., and Mnaymneh, W.

(1982). Primary malignant fibrous histiocytoma of bone:

Report of six cases with ultrastructural study and analysis of

the literature. Cancer 49, 698–707.

626 The American Journal of Human Genetics 90, 614–627, April 6, 2012

39. Knudson, A.G. (2001). Two genetic hits (more or less) to

cancer. Nat. Rev. Cancer 1, 157–162.

40. Della Ragione, F., Cartenı-Farina, M., Gragnaniello, V., Schet-

tino,M.I., and Zappia, V. (1986). Purification and characteriza-

tion of 50-deoxy-50-methylthioadenosine phosphorylase from

human placenta. J. Biol. Chem. 261, 12324–12329.

41. Della Ragione, F., Oliva, A., Gragnaniello, V., Russo, G.L.,

Palumbo, R., and Zappia, V. (1990). Physicochemical and

immunological studies on mammalian 50-deoxy-50-methyl-

thioadenosine phosphorylase. J. Biol. Chem. 265, 6241–6246.

42. Appleby, T.C., Erion, M.D., and Ealick, S.E. (1999). The struc-

ture of human 50-deoxy-50-methylthioadenosine phosphory-

lase at 1.7 A resolution provides insights into substrate

binding and catalysis. Structure 7, 629–641.

43. Williams-Ashman, H.G., Seidenfeld, J., and Galletti, P. (1982).

Trends in the biochemical pharmacology of 50-deoxy-50-meth-

ylthioadenosine. Biochem. Pharmacol. 31, 277–288.

44. Schramm, V.L. (2007). Enzymatic transition state theory and

transition state analogue design. J. Biol. Chem. 282, 28297–

28300.

45. Bannert, N., and Kurth, R. (2006). The evolutionary dynamics

of human endogenous retroviral families. Annu. Rev. Geno-

mics Hum. Genet. 7, 149–173.

46. Smit, A.F. (1993). Identification of a new, abundant super-

family of mammalian LTR-transposons. Nucleic Acids Res.

21, 1863–1872.

47. Fearon, E.R. (1997). Human cancer syndromes: Clues to the

origin and nature of cancer. Science 278, 1043–1050.

48. McPherson, R., Pertsemlidis, A., Kavaslar, N., Stewart, A., Rob-

erts, R., Cox, D.R., Hinds, D.A., Pennacchio, L.A., Tybjaerg-

Hansen, A., Folsom, A.R., et al. (2007). A common allele on

chromosome 9 associated with coronary heart disease. Science

316, 1488–1491.

49. Helgadottir, A., Thorleifsson, G., Manolescu, A., Gretarsdottir,

S., Blondal, T., Jonasdottir, A., Jonasdottir, A., Sigurdsson, A.,

Baker, A., Palsson, A., et al. (2007). A common variant on chro-

mosome 9p21 affects the risk of myocardial infarction.

Science 316, 1491–1493.

50. Willer, C.J., Sanna, S., Jackson, A.U., Scuteri, A., Bonnycastle,

L.L., Clarke, R., Heath, S.C., Timpson, N.J., Najjar, S.S., String-

ham, H.M., et al. (2008). Newly identified loci that influence

lipid concentrations and risk of coronary artery disease. Nat.

Genet. 40, 161–169.

51. van der Kooi, A.J., Ledderhof, T.M., de Voogt, W.G., Res, C.J.,

Bouwsma, G., Troost, D., Busch, H.F., Becker, A.E., and de

Visser, M. (1996). A newly recognized autosomal dominant

limb girdle muscular dystrophy with cardiac involvement.

Ann. Neurol. 39, 636–642.

52. Kimonis, V.E., Kovach, M.J., Waggoner, B., Leal, S., Salam, A.,

Rimer, L., Davis, K., Khardori, R., and Gelber, D. (2000). Clin-

ical and molecular studies in a unique family with autosomal

dominant limb-girdle muscular dystrophy and Paget disease

of bone. Genet. Med. 2, 232–241.

53. Takigawa, M., Nishida, Y., Suzuki, F., Kishi, J., Yamashita, K.,

and Hayakawa, T. (1990). Induction of angiogenesis in chick

yolk-sac membrane by polyamines and its inhibition by tissue

inhibitors of metalloproteinases (TIMP and TIMP-2). Bio-

chem. Biophys. Res. Commun. 171, 1264–1271.

54. Harris, S.P., Patel, J.R., Marton, L.J., and Moss, R.L. (2000).

Polyamines decrease Ca(2þ) sensitivity of tension and

increase rates of activation in skinned cardiac myocytes. Am.

J. Physiol. Heart Circ. Physiol. 279, H1383–H1391.

55. Tantini, B., Fiumana, E., Cetrullo, S., Pignatti, C., Bonavita, F.,

Shantz, L.M., Giordano, E., Muscari, C., Flamigni, F., Guar-

nieri, C., et al. (2006). Involvement of polyamines in

apoptosis of cardiac myoblasts in a model of simulated

ischemia. J. Mol. Cell. Cardiol. 40, 775–782.

56. Harismendy, O., Notani, D., Song, X., Rahim, N.G., Tanasa, B.,

Heintzman, N., Ren, B., Fu, X.D., Topol, E.J., Rosenfeld, M.G.,

and Frazer, K.A. (2011). 9p21 DNA variants associated with

coronary artery disease impair interferon-g signalling

response. Nature 470, 264–268.

57. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody,

M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh,

W., et al; International Human Genome Sequencing Consor-

tium. (2001). Initial sequencing and analysis of the human

genome. Nature 409, 860–921.

58. Mayer, J., and Meese, E. (2005). Human endogenous retrovi-

ruses in the primate lineage and their influence on host

genomes. Cytogenet. Genome Res. 110, 448–456.

59. Volff, J.N., and Brosius, J. (2007). Modern genomes with retro-

look: Retrotransposed elements, retroposition and the origin

of new genes. In Gene and Protein Evolution. Genome

Dynamics, J.-N. Volff, ed. (Basel: Karger), pp. 175–190.

60. Blaise, S., de Parseval, N., Benit, L., and Heidmann, T. (2003).

Genomewide screening for fusogenic human endogenous

retrovirus envelopes identifies syncytin 2, a gene conserved

on primate evolution. Proc. Natl. Acad. Sci. USA 100,

13013–13018.

61. Blond, J.L., Lavillette, D., Cheynet, V., Bouton, O., Oriol, G.,

Chapel-Fernandes, S., Mandrand, B., Mallet, F., and Cosset,

F.L. (2000). An envelope glycoprotein of the human endoge-

nous retrovirus HERV-W is expressed in the human placenta

and fuses cells expressing the type D mammalian retrovirus

receptor. J. Virol. 74, 3321–3329.

62. Mi, S., Lee, X., Li, X., Veldman, G.M., Finnerty, H., Racie,

L., LaVallie, E., Tang, X.Y., Edouard, P., Howes, S., et al.

(2000). Syncytin is a captive retroviral envelope protein

involved in human placental morphogenesis. Nature 403,

785–789.

63. Brosius, J., and Gould, S.J. (1992). On ‘‘genomenclature’’: A

comprehensive (and respectful) taxonomy for pseudogenes

and other ‘‘junk DNA’’. Proc. Natl. Acad. Sci. USA 89, 10706–

10710.

64. Krull, M., Brosius, J., and Schmitz, J. (2005). Alu-SINE exoniza-

tion: En route to protein-coding function. Mol. Biol. Evol. 22,

1702–1711.

65. Baertsch, R., Diekhans, M., Kent, W.J., Haussler, D., and Bro-

sius, J. (2008). Retrocopy contributions to the evolution of

the human genome. BMC Genomics 9, 466.

The American Journal of Human Genetics 90, 614–627, April 6, 2012 627

ARTICLE

Large-Scale Population Analysis Challengesthe Current Criteria for the Molecular Diagnosisof Fascioscapulohumeral Muscular Dystrophy

Isabella Scionti,1 Francesca Greco,1 Giulia Ricci,2 Monica Govi,1 Patricia Arashiro,3 Liliana Vercelli,4

Angela Berardinelli,5 Corrado Angelini,6 Giovanni Antonini,7 Michelangelo Cao,6 Antonio Di Muzio,8

Maurizio Moggio,9 Lucia Morandi,10 Enzo Ricci,11 Carmelo Rodolico,12 Lucia Ruggiero,13

Lucio Santoro,13 Gabriele Siciliano,2 Giuliano Tomelleri,14 Carlo Pietro Trevisan,15 Giuliana Galluzzi,16

Woodring Wright,17 Mayana Zatz,18 and Rossella Tupler1,19,*

Facioscapulohumeral muscular dystrophy (FSHD) is a common hereditary myopathy causally linked to reduced numbers (%8) of 3.3

kilobase D4Z4 tandem repeats at 4q35. However, because individuals carrying D4Z4-reduced alleles and no FSHD and patients with

FSHD and no short allele have been observed, additionalmarkers have been proposed to support an FSHDmolecular diagnosis. In partic-

ular a reduction in the number of D4Z4 elements combined with the 4A(159/161/168)PAS haplotype (which provides the possibility of

expressing DUX4) is currently used as the genetic signature uniquely associated with FSHD. Here, we analyzed these DNA elements in

more than 800 Italian and Brazilian samples of normal individuals unrelated to any FSHD patients. We find that 3% of healthy subjects

carry alleles with a reduced number (4–8) of D4Z4 repeats on chromosome 4q and that one-third of these alleles, 1.3%, occur in combi-

nation with the 4A161PAS haplotype. We also systematically characterized the 4q35 haplotype in 253 unrelated FSHD patients. We find

that only 127 of them (50.1%) carry alleles with 1–8 D4Z4 repeats associated with 4A161PAS, whereas the remaining FSHD probands

carry different haplotypes or alleles with a greater number of D4Z4 repeats. The present study shows that the current genetic signature

of FSHD is a common polymorphism and that only half of FSHD probands carry this molecular signature. Our results suggest that

the genetic basis of FSHD, which is remarkably heterogeneous, should be revisited, because this has important implications for genetic

counseling and prenatal diagnosis of at-risk families.

Introduction

Facioscapulohumeral muscular dystrophy (FSHD [MIM

158900]), a common myopathy, has a prevalence of 1 in

20,000.1,2 The disease is characterized byweakness of selec-

tive muscle groups and wide variability of clinical expres-

sion.1,3,4 The onset of the disease is in the second or third

decade of life and usually involves the weakening of facial

and limb-girdle muscles. The mode of inheritance of

classical FSHD is considered to be autosomal dominant,

with complete penetrance by age 20.4,5 No biochemical,

histological, or instrumental markers are available to inde-

pendently confirm a specific FSHD diagnosis that remains

mainly clinical.

The FSHD genetic defect does not reside in any protein-

coding gene.6 Instead, FSHD has been genetically linked

to the reduction of an integral number of tandem 3.3-kb

D4Z4 repeats located on chromosome 4q35.7,8 Although

nearly identical D4Z4 sequences reside on chromosome

10q26,9 only subjects with a reduced number of D4Z4

repeats on chromosome 4, but not chromosome 10,

develop FSHD.10–12 Based on these results, p13E-11 EcoRI

alleles larger than 50 kb (R11 D4Z4 repeats) originating

from chromosome 4 have been considered normal,

whereas alleles of 35 kb or less (%8 D4Z4 repeats) have

been considered diagnostic for the disease.8,13

Because there are individuals with reduced D4Z4 alleles

that do not have clinical signs of FSHD,14,15 it has been pro-

posed that additional DNA sequences flanking the D4Z4

repeat array are necessary for disease development.16–18

These studies concluded that D4Z4 reduction is pathogenic

only in a few genetic backgrounds, which include a specific

simple sequence length polymorphism (SSLP) proximal

to the D4Z4 repeat and the 4qA polymorphism distal to

1Department of Biomedical Sciences, University of Modena and Reggio Emilia, Modena 41125, Italy; 2Department of Neuroscience, Neurological Clinic,

University of Pisa, Pisa 56126, Italy; 3Program in Genomics, Division of Genetics, Informatics Program, Children’s Hospital, The Howard Hughes Medical

Institute, Harvard Medical School, Boston, MA 02115, USA; 4Department of Neuroscience, Center for Neuromuscular Diseases, University of Turin, Turin

10126, Italy; 5Unit of Child Neurology and Psychiatry, IRCCS ‘‘C. Modino’’ Foundation, University of Pavia, Pavia 27100, Italy; 6Department of Neurosci-

ences, University of Padua, Padua 35129, Italy; 7Department of Neuroscience, Salute Mentale e Organi di Senso, S. Andrea Hospital, University of Rome

‘‘Sapienza,’’ Rome 00189, Italy; 8Center for Neuromuscular Disease, University ‘‘G. d’Annunzio,’’ Chieti 66013, Italy; 9Neuromuscular Unit, IRCCS

Foundation Ca Granda Ospedale Maggiore Policlinico, Dino Ferrari Center, University of Milan, Milan 20122, Italy; 10Unit of Muscular Pathology and

Immunology, Neurological Institute Foundation ‘‘Carlo Besta,’’ Milano 20133, Italy; 11Department of Neurosciences, Universita Cattolica Policlinico

A. Gemelli, Rome 00168, Italy; 12Department of Neurosciences, Psychiatry and Anaesthesiology, University of Messina, Messina 98125, Italy; 13Department

of Neurological Sciences, University ‘‘Federico II,’’ Naples 80131, Italy; 14Department of Neurological Sciences and Vision, University of Verona, Verona

37134, Italy; 15Department of Neurological and Psychiatric Sciences, University of Padua, Padua 35100, Italy; 16Molecular Genetics Laboratory of UILDM,

Lazio Section, IRCCS Santa Lucia Foundation, Rome 00179, Italy; 17Department of Cell Biology, University of Texas Southwestern Medical Center, Dallas,

TX 75390, USA; 18Human Genome Research Center, Department of Genetics and Evolutionary Biology, Institute of Biosciences, University of Sao Paulo,

Sao Paulo 05508-090, Brazil; 19Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.019. �2012 by The American Society of Human Genetics. All rights reserved.

628 The American Journal of Human Genetics 90, 628–635, April 6, 2012

the repeat (Figure 1). These haplotypes, named 4A159,

4A161, and 4A168, have been proposed to be uniquely

associated with FSHD. Recently it has been shown that

a single-nucleotide polymorphism (SNP) in the pLAM

sequence of the 4qA alleles provides a polyadenylation

signal (PAS; ATTAAA) for the DUX4 transcript from the

most distal D4Z4 unit on 4qA chromosomes. Thus, the

molecular signature, named 4A(159,161,168)PAS, has

been proposed to define alleles causally related to FSHD.

This signature results from the combination of (1) a reduc-

tion in the number of D4Z4 elements, (2) the presence of

the 4qA allele, and (3) the PAS in the pLAM sequence. In

this scenario, FSHD arises from a specific genetic setting

enabling the normally silent double homeobox protein 4

gene (DUX4 [MIM606009]) tobe expressed.18On this basis,

healthy subjects carrying reduced D4Z4 alleles would be

explained by the absence of the 4A(159,161,168)PAS.

This model does not apply to all FSHD cases. For

example, nonpenetrant carriers have been reported in

FSHD families,14,15 and there are FSHD patients carrying

full-length D4Z4 alleles (R11 repeats) that are clinically

indistinguishable from patients carrying D4Z4 alleles of

reduced size (%8 repeats).19 Rare exceptions could be ex-

plained by a variety of mechanisms that do not challenge

the basic hypothesis. However, recently we found that

2.7% of cases in the Italian National Registry for FSHD

(which contains over 1,100 unrelated FSHD patients)

were compound heterozygotes carrying two D4Z4-reduced

alleles (0.5% were homozygotes for the 4A161 haplotype).

Based on this finding, we estimated that the population

frequency of the 4A161PAS haplotype associated with a

D4Z4-reduced allele could be higher than 1%.20

The correlation between genotype and phenotype in

FSHD thus appears to be more complex than just the

Figure 1. Schematic Representation of Polymorphisms at the 4q and 10q Subtelomeres(A) Schematic representation of the method used to calculate D4Z4 repeat numbers from EcoRI fragment sizes. The D4Z4 repeat array isindicated with triangles. Seven and eight D4Z4 repeats (31–36 kb EcoRI fragment size) were defined to be the upper diagnostic range forFSHD. D4Z4 repeat units on chromosomes 4 and 10 can be distinguished because all repeats on 10q contain BlnI restriction sites(B within the black triangles), whereas all D4Z4 repeats on 4q contain XapI restriction sites (X within the white triangles).(B) Schematic representation of the current view of pathogenic haplotypes.(C) Elements examined in the present study. In addition to the number of D4Z4 repeats, elements that distinguish subjects include: (1)the chromosomal localization of the D4Z4 repeat, chromosome 4q35 or 10q26; (2) the SSLP, which is a combination of five variablenumber tandem repeats, an 8 bp insertion/deletion, and two SNPs localized 3.5 kb proximal to D4Z4; it varies in length between 157and 182 bp; (3) the AT(T/C)AAA SNP in the pLAM region; (4) a large sequence variation (termed 4qA or B) that is distal to D4Z4. Inthe 4qB variant, the terminal 3.3 kb repeat contains only 570 bp of a complete repeat, whereas in the 4qA variant the terminal repeatis a divergent 3.3 kb repeat named pLAM. 4q chromosomes that do not hybridize to probes for (A) and (B) are termed ‘‘null,’’ and theirsequences vary from case to case.

The American Journal of Human Genetics 90, 628–635, April 6, 2012 629

presence of the 4APAS signature. In order to confirm the

high frequency of this signature in the normal population

and reevaluate the allele distribution in FSHD patients, we

performed a systematic unbiased clinical and molecular

study of 801 normal control subjects from Italy and Brazil

and 253 FSHD probands from the Italian Registry for

FSHD. Our results establish that the 4APAS structure is

a frequent genetic polymorphism that is neither sufficient

nor necessary for the development of FSHD. This result is

not incompatible with evidence implicating DUX4 or

other factors as important mediators of the disease.

However, it does demonstrate that the pathogenesis is

more complex than currently thought and that the current

genetic signature is insufficient for diagnosis.

Subjects and Methods

Control PopulationThe control group consisted of 801 unrelated healthy subjects

with no family history of muscular dystrophy. Subjects were

recruited from the Italian and Brazilian populations through

advertisements. Italian controls subjects were equally distributed

among Northern, Central, and Southern regions. The local ethics

committee approved the study. All subjects enrolled in the

study were clinically and molecularly characterized after giving

informed consent to participate (see Table S1 available online).

FSHD PatientsTwo-hundred-fifty-three unrelated FSHD patients were accrued

through the Italian National Registry for FSHD. All subjects were

clinically and molecularly characterized. In particular, we consid-

ered patients to have typical FSHD if: (1) disease onset occurred

in facial or shoulder girdle muscles; (2) there was facial and/or

scapular fixator weakness; and (3) there was absence of atypical

signs suggesting an alternative diagnosis (including extraocular,

masticatory, pharyngeal, or lingual muscle weakness and cardio-

myopathy).21,22 Clinical data were collected with the FSHD

clinical form. The clinical severity of the disease was measured

according to the FSHD score, as previously described.23 Briefly,

the FSHD score quantifies the degree of weakness and defines

the level of disability affecting six separate muscle groups: facial

(score 0–2), shoulder girdle (score 0–3), upper limbs (score 0–2),

pelvic girdle (score 0–5), leg muscles (score 0–3), and Beevor’s

sign (score 0–1).23 The final clinical evaluation score, calculated

by summing the single scores, ranged from 0, when no signs of

muscle weakness are present, to 15, when all muscle groups tested

are severely impaired.23 All selected subjects were evaluated using

a standard protocol, and each subject received the standardized

FSHD score previously described.23

Molecular Genetic AnalysisDNA was prepared from isolated lymphocytes according to stan-

dard procedures. In brief, restriction endonuclease digestion of

DNA was performed in agarose plugs with the appropriate restric-

tion enzyme: EcoRI, EcoRI/BlnI, XapI (p13E-11 probe), Hind III

(4qA/4qB probes), and NotI (B31 probe). Digested DNA was

separated by pulsed-field gel electrophoresis (PFGE) in 1% agarose

gels. Allele sizes were estimated by southern hybridization with

probe p13E-11 of 7 mg of EcoRI-, EcoRI/BlnI-, and XapI-digested

genomic DNA extracted from peripheral blood lymphocytes,

electrophoresed in a 0.4% agarose gel for 45–48 hr at 35 V, along-

side an 8–48 kb marker (Bio-Rad). To assess the chromosomal

origin of the two D4Z4-reduced alleles, DNA from each proband

was analyzed by NotI digestion and hybridization with the

B31 probe (Figure S1). Restriction fragments were detected

by autoradiography or by using a Typhoon Trio system (GE

Healthcare).

4qA/4qB allelic variants were defined using 7 mg of HindIII-

digested DNA, PFGE electrophoresis, and Southern blot hybridiza-

tion with radiolabeled 4qB and 4qA probes according to standard

procedures. The 4qA/4qB variants were attributed to each chromo-

some based on the size of EcoRI restriction fragments (Figure S1).

To define the SSLP and the pLAM SNP (AT(T/C)AAA) sequences

flanking the D4Z4 repeat units, linear gel electrophoresis of

EcoRI-digested DNA was used to isolate each D4Z4-reduced allele.

The SSLP sequence was determined after PCR amplification using

specific oligonucleotides (forward primer 50-GGTGGAGTTCTGGT

TTCAGC-30 labeled with hexachlorofluorescein [HEX], reverse

primer 50-CCTGTGCTTCAGAGGCATTTG-30) as previously re-

ported.17,18 Analysis of the pLAM SNP was performed on PCR-

amplified DNA using specific oligonucleotides (forward primer

50-ACGCTGTCTAGGCAAACCTG-30, reverse primer 50-TGCAC

TCATCACACAAAAGATG-30). SSLP size differences and pLAM

sequences were analyzed using an ABI Prism 3130 Genetic

Analyzer.17,18

Results

Presence of FSHD-Sized Alleles in the Healthy

Population

Previous smaller investigations of the Dutch population

suggested that 3% of healthy subjects had D4Z4 alleles of

35–38 kb in size, and one-third of these might present

a potentially pathogenic 4A allele.11,24 An additional

recent genetic criteria is that, in order to have FSHD, the

reduction of D4Z4 repeats must be associated with

a specific chromosomal background, 4A(159/161/168)

PAS, allowing the expression of the DUX4 gene.18

However, the frequency of compound heterozygotes in

patients with FSHD suggested that the frequency of

D4Z4-reduced 24–35 kb alleles associated with the

4A161PAS in the Italian population would be >1%.20

Because this prediction has crucial implications for clinical

practice, we searched for D4Z4-reduced alleles associated

with the 4A161PAS haplotype in 801 healthy individuals,

560 from Italy and 241 from Brazil. Figure 2 shows that

25 of these 801 subjects carry D4Z4 alleles ranging from

21 to 35 kb (4 to 8 D4Z4 units); 17 of these 25 alleles are

associated with 4qA (Figure 2, groups 1 and 2), and 11 carry

the 4A161PAS haplotype (Figure 2, group 1; Figures S2–S7).

Therefore, 3% (25 of 801) of normal controls carry D4Z4

alleles of reduced size, and ~1.3% (11) have the supposedly

pathogenic 4A161PAS haplotype. The age of all these

healthy carriers ranged between 40 and 78 years, an age

in which FSHD is considered to be fully penetrant. On

this basis, we conclude that the haplotype 4A161PAS has

the frequency of a common polymorphism and that it

630 The American Journal of Human Genetics 90, 628–635, April 6, 2012

may be permissive but is not sufficient to cause autosomal-

dominant disease.

Multiple Haplotypes Associated with FSHD

Our observation that in the general population 1.3% of

healthy subjects carry the FSHD ‘‘pathogenic’’ signature

4A161PAS, which enables the expression of DUX4, chal-

lenges the notion that FSHD is a fully-penetrant auto-

somal-dominant disorder caused by the reduction of

D4Z4 repeat number associated with 4A161PAS haplotype.

To test theDUX4 polyadenylation model more broadly, we

systematically studied the 4q35 haplotype of 253 probands

accrued through the Italian National Registry for FSHD

(Table S2). The D4Z4 repeat size was systematically studied

in all subjects. Table 1 shows that 204 of 253 probands

(80.6%) carry D4Z4 alleles with 1–8 units, 19 (7.5%) have

D4Z4 alleles with 9–10 repeats, and the remaining 30

(11.8%) show large D4Z4 alleles (R11 repeats) on both

copies of chromosome 4 (Table 1). We then analyzed the

223 FSHD patients with 10 or fewer D4Z4 repeats for the

presence of the 4A/B, PAS, and SSLP (see Figure 1). Only

127 FSHD probands carry the 4A161PAS haplotype associ-

ated with alleles having 1–8 D4Z4 repeats (group 1 in

Figure 3). Among the remaining probands, 52 have

reduced alleles associated with the 4A166PAS haplotype

previously considered to not be ‘‘permissive’’ for FSHD

disease (group 2 in Figure 3), 13 carry the 4A162PAS, 5

carry the 4A164PAS, 2 carry the 4A167PAS, 1 carries the

4A163PAS (groups 7–10 in Figure 3), and 3 bear reduced

D4Z4 alleles with the 4qB polymorphism, which lacks

Figure 2. Molecular Haplotypes of 25 HealthySubjects from Italian and Brazilian ControlPopulationsHaplotypes of 25 4q alleles with 4–8 D4Z4 repeatsidentified in 801 normal healthy individualsrandomly selected from the general populationsof Italy and Brazil. We characterized the sequencevariants in the SSLP (colored rectangles), D4Z4repeat number (triangles), distal variants A or B(shaded boxes), and the pLAM PAS (unshadedrectangles). The second and third columnspresent the number of reduced D4Z4 alleles de-tected (N�), the provenance (p) of the subjects(Ita for Italy and Bra for Brazil), and the preva-lence of each haplotype among the 801 individ-uals examined (%). The supposedly pathogenichaplotype 4A161PAS is the most frequent and ispresent in 1.3% of healthy controls.

Table 1. Distribution of 253 Unrelated FSHD Patients versus D4Z4Repeat Units

Number of Unrelated FSHD Patients

D4Z4 Repeat Units De Novo Cases Familial Cases Percentage

1–3 5 22 10.7%

4–8 2 175 69.9%

9–10 0 19 7.5%

>11 0 30 11.8%

both the pLAM region and the PAS (groups

3–4 in Figure 3). Examples of each group

are reported in Figures S8–S12. Collectively,

our data reveal that in our cohort of FSHD

probands the SSLP allelic variants associated

with D4Z4-reduced alleles differ from those

previously reported (compare groups 2–10

in Figure 3 with Figure 1).17,18 This geno-

typic difference is also supported by the

fact that we did not find the 4A168 ‘‘permis-

sive’’ haplotype associated with FSHD in our population.

In contrast, haplotypes considered to not be ‘‘permissive’’

for FSHD disease were frequent. In particular, the

4A166PAS haplotype is present associated with almost

one-quarter (23.3%) of D4Z4-reduced alleles detected in

our FSHD probands. More importantly, 49 of 253 FSHD

probands (19%) carry alleles with more than 8 D4Z4

repeats, and only 127 (50.1%) carry D4Z4-reduced alleles

associated with the 4A161PAS, the expected molecular

signature for FSHD.

Discussion

The practice of medical genetics requires a clear, definite

evaluation of the significance of mutations and/or varia-

tions of DNA sequences for diagnosis to provide prognostic

information and genetic counseling. This is particularly

important for a progressive disease with unpredictable

The American Journal of Human Genetics 90, 628–635, April 6, 2012 631

onset and a high variability of clinical expression, such as

FSHD.

The extensive use over the past 20 years of DNA anal-

ysis for studying Mendelian disorders has revealed many

complex mechanisms in addition to single mutant genes

that cause disease. Identical phenotypes may be produced

by mutations in different genes,25 the same mutation can

cause different phenotypes,26 and distinct mutations in

the same gene may result in different disorders that

segregate with diverse Mendelian or even multifactorial

patterns.27 In addition, the incomplete penetrance of

certain mutations argues for the importance of modifying

loci or epigenetic mechanisms influencing the clinical

expression in many Mendelian disorders.28 Thus, estab-

lishing the value of mutational events underlying genetic

diseases may be complex even when there are simple

patterns of inheritance in diseases with a well-character-

ized pathologic course.

FSHD seems to fall in this complex pattern, even

though it is currently considered to be a fully penetrant

disease with a wide variability in clinical spectrum,

ranging from subjects with very mild muscle weakness

to wheelchair-bound patients.5,13 The molecular test

initially used for FSHD diagnosis was based on the obser-

vation that 95% of FSHD patients carry a reduction of

integral numbers of D4Z4 repeats at 4q35 with full pene-

trance.10 However, the wide use of this test revealed

several exceptions to the original model. Through the

years, the threshold size of D4Z4 alleles has been

increased from the original 28 kb7 (6 repeats) to 35 kb10

(8 repeats), with FSHD cases carrying D4Z4 alleles of 38–

41 kb (9–11 repeats), considered borderline alleles.22,29

Figure 3. Molecular Haplotypes of 223 Unrelated FSHD PatientsOverview of haplotypes of the D4Z4-reduced alleles (%10 repeats) on chromosome 4 found in 223 unrelated FSHD patients. Alleles areseparated into 1–3 (column 1), 4–8 (column 3), and 9–10 (column 5) D4Z4 repeat units. Within these columns, we group the observedhaplotypes based on the type of SSLP and PAS. N� indicates the number of alleles found in each haplotype, and TOT(%) represents thetotal number of alleles found and the prevalence of each haplotype. The 4A haplotypes not previously observed in FSHDpatients include4A166 (group 2), 4A162 (group 7), 4A164 (group 8), 4A167 (group 9), and 4A163 (group 10), and the 4B haplotypes include 4B163 (group3) and 4B166 (group 4). Frequencies of 4A161 (group 1, 63.7%) and 4A166 (group 2, 25.1%) are different than previously reported inother normal populations. Chromosomes with 4qB haplotypes (groups 3–4) lack the pLAM and PAS.

632 The American Journal of Human Genetics 90, 628–635, April 6, 2012

Additional genotype-phenotype studies led to the identifi-

cation of subjects carrying D4Z4-reduced alleles with no

sign of muscle weakness in FSHD families,14,15 as well as

of healthy unrelated subjects without family history of

FSHD.11,30 The present results from our systematic clinical

and molecular analysis of FSHD patients from the Italian

National Registry for FSHD, as well as a large number of

healthy controls, challenge the current model for FSHD

diagnosis.

Remarkably, our data establish as a general rule rather

than an exception that detection of a D4Z4-reduced

allele is not sufficient to diagnose FSHD. Although

the majority of FSHD patients (70%) carry D4Z4

alleles with 4–8 units, this size range is carried by 3% of

healthy subjects from the general population. Addition-

ally, there is little predictive value of the 4qA161PAS

haplotype in the absence of family history, because

1.3% of healthy subjects carry this haplotype, which

therefore has the frequency of a common polymorphism

(Figure 1) rather than a rare mutation. Finally, 49 of 253

probands (19%) do not carry D4Z4 alleles with 1–8

repeats, and only 50% of the probands carry the 4A161

permissive haplotype.

In summary, our study indicates that a profound

rethinking of the genetic disease mechanism and modes

of inheritance of FSHD are now required and that entirely

newmodels and approaches are needed. Our results do not

exclude an important pathogenic role for DUX4 or other

candidate factors but do establish a complex mechanism

beyond current understanding. Indeed, our data point

at the possibility that in the heterozygous state a D4Z4

reductionmight produce a subclinical sensitized condition

that requires other epigenetic mechanisms or a contrib-

uting factor to cause overt myopathy. In some rare cases,

that could be by becoming homozygous20 and doubling

the dose of a dominant factor such as DUX4. In others, it

might be by the simultaneous heterozygosity for a different

and recessive myopathy, as suggested by many reports

in which the FSHD contractions are found in association

with a second molecular defect.31–44 This possibility is

also consistent with previous reports of expression

changes of candidate proteins such as CRYM that were

associated with FSHD in some families but that were

unchanged when other families were examined. Finally,

it is also plausible that drugs or toxic agents might

contribute to the disease onset and clinical variability.

This would explain the observation of discordant mono-

zygotic twins carrying the FSHD reduction.45,46 It is hoped

that broadening the scope of investigations, including

next-generation deep sequencing in particular in families

with asymptomatic and clinically affected members

carrying the same FSHD allele, may finally lead to an

understanding of the molecular pathogenesis of this

complex disease. These findings have important clinical

implications for genetic counseling of patients and fami-

lies with FSHD, with particular regard to the interpretation

of data in prenatal diagnosis.

Supplemental Data

Supplemental Data include twelve figures and two tables and can

be found with this article online at http://www.cell.com/AJHG/.

Acknowledgments

We are indebted to all FSHD patients and their families for partici-

pating in this study. The Associazione Amici del Centro Dino

Ferrari-University of Milan is gratefully acknowledged. We thank

Paul D. Kaufman and Michael R. Green for their in-depth critique

of the manuscript. DNA from Brazilian controls was kindly

provided by Naila Lourenco and Antonia Cerqueira (Department

of Genetics and Evolutionary Biology, Institute of Biosciences,

University of Sao Paulo, Brazil). This work was supported by Tele-

thon GUP08004 and GUP11009, by Association Francaise Contre

les Myopathies 14339, by National Institute of Health-National

Institutes of Neurological Disorders and Stroke grant RO1

NS047584, by Centros de Pesquisa, Inovacao e Difusao/Fundacao

de Amparo a Pesquisa do Estado de Sao Paulo, by Institutos

Nacionais de Ciencia e Tecnologia, and by Conselho Nacional de

Desenvolvimento Cientıfico e Tecnologico.

Received: November 12, 2011

Revised: January 27, 2012

Accepted: February 16, 2012

Published online: April 4, 2012

Web Resources

The URL for data presented herein is as follows:

Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.

nlm.nih.gov/Omim/

References

1. Padberg, G.W. (1982). Facioscapulohumeral disease. PhD

thesis, Leiden University, Leiden, Holland.

2. Mostacciuolo, M.L., Pastorello, E., Vazza, G., Miorin, M.,

Angelini, C., Tomelleri, G., Galluzzi, G., and Trevisan, C.P.

(2009). Facioscapulohumeral muscular dystrophy: epidemio-

logical and molecular study in a north-east Italian population

sample. Clin. Genet. 75, 550–555.

3. Flanigan, K.M. (2004). Facioscapulohumeral muscular dys-

trophy and scapuloperoneal disorders. In Myology, A. Engel

and C. Franzini-Armstrong, eds. (New York: McGrow Hill

Professional), pp. 1123–1133.

4. Lunt, P.W., and Harper, P.S. (1991). Genetic counselling in

facioscapulohumeral muscular dystrophy. J. Med. Genet. 28,

655–664.

5. Tawil, R., van der Maarel, S., Padberg, G.W., and van Engelen,

B.G. (2010). 171st ENMC international workshop: Standards

of care and management of facioscapulohumeral muscular

dystrophy. Neuromuscul. Disord. 20, 471–475.

6. Hewitt, J.E., Lyle, R., Clark, L.N., Valleley, E.M., Wright, T.J.,

Wijmenga, C., van Deutekom, J.C., Francis, F., Sharpe, P.T.,

Hofker, M., et al. (1994). Analysis of the tandem repeat locus

D4Z4 associated with facioscapulohumeral muscular

dystrophy. Hum. Mol. Genet. 3, 1287–1295.

7. Wijmenga, C., Hewitt, J.E., Sandkuijl, L.A., Clark, L.N.,Wright,

T.J., Dauwerse, H.G., Gruter, A.M., Hofker, M.H., Moerer, P.,

The American Journal of Human Genetics 90, 628–635, April 6, 2012 633

Williamson, R., et al. (1992). Chromosome 4q DNA rearrange-

ments associated with facioscapulohumeral muscular

dystrophy. Nat. Genet. 2, 26–30.

8. van Deutekom, J.C.T., Wijmenga, C., van Tienhoven, E.A.,

Gruter, A.M., Hewitt, J.E., Padberg, G.W., van Ommen, G.J.,

Hofker, M.H., and Frants, R.R. (1993). FSHD associated DNA

rearrangements are due to deletions of integral copies of

a 3.2 kb tandemly repeated unit. Hum. Mol. Genet. 2, 2037–

2042.

9. Deidda, G., Cacurri, S., Grisanti, P., Vigneti, E., Piazzo, N., and

Felicetti, L. (1995). Physicalmapping evidence for a duplicated

region on chromosome 10qter showing high homology with

the facioscapulohumeral muscular dystrophy locus on chro-

mosome 4qter. Eur. J. Hum. Genet. 3, 155–167.

10. van Deutekom, J.C.T., Bakker, E., Lemmers, R.J., van der Wie-

len, M.J., Bik, E., Hofker, M.H., Padberg, G.W., and Frants, R.R.

(1996). Evidence for subtelomeric exchange of 3.3 kb

tandemly repeated units between chromosomes 4q35 and

10q26: implications for genetic counselling and etiology of

FSHD1. Hum. Mol. Genet. 5, 1997–2003.

11. van Overveld, P.G.M., Lemmers, R.J., Deidda, G., Sandkuijl, L.,

Padberg, G.W., Frants, R.R., and van der Maarel, S.M. (2000).

Interchromosomal repeat array interactions between chromo-

somes 4 and 10: a model for subtelomeric plasticity. Hum.

Mol. Genet. 9, 2879–2884.

12. Matsumura, T., Goto, K., Yamanaka, G., Lee, J.H., Zhang, C.,

Hayashi, Y.K., and Arahata, K. (2002). Chromosome 4q;10q

translocations; comparison with different ethnic populations

and FSHD patients. BMC Neurol. 2, 7.

13. Lunt, P.W., Jardine, P.E., Koch, M.C., Maynard, J., Osborn, M.,

Williams,M., Harper, P.S., and Upadhyaya, M. (1995). Correla-

tion between fragment size at D4F104S1 and age at onset or at

wheelchair use, with a possible generational effect, accounts

for much phenotypic variation in 4q35-facioscapulohumeral

muscular dystrophy (FSHD). Hum. Mol. Genet. 4, 951–958.

14. Ricci, E., Galluzzi, G., Deidda, G., Cacurri, S., Colantoni, L.,

Merico, B., Piazzo, N., Servidei, S., Vigneti, E., Pasceri, V.,

et al. (1999). Progress in themolecular diagnosis of facioscapu-

lohumeral muscular dystrophy and correlation between the

number of KpnI repeats at the 4q35 locus and clinical pheno-

type. Ann. Neurol. 45, 751–757.

15. Tonini, M.M., Passos-Bueno, M.R., Cerqueira, A., Matioli, S.R.,

Pavanello, R., and Zatz, M. (2004). Asymptomatic carriers and

gender differences in facioscapulohumeral muscular dys-

trophy (FSHD). Neuromuscul. Disord. 14, 33–38.

16. Lemmers, R.J., Wohlgemuth, M., Frants, R.R., Padberg, G.W.,

Morava, E., and van der Maarel, S.M. (2004). Contractions of

D4Z4 on 4qB subtelomeres do not cause facioscapulohumeral

muscular dystrophy. Am. J. Hum. Genet. 75, 1124–1130.

17. Lemmers, R.J., Wohlgemuth, M., van der Gaag, K.J., van der

Vliet, P.J., van Teijlingen, C.M., de Knijff, P., Padberg, G.W.,

Frants, R.R., and van der Maarel, S.M. (2007). Specific

sequence variations within the 4q35 region are associated

with facioscapulohumeral muscular dystrophy. Am. J. Hum.

Genet. 81, 884–894.

18. Lemmers, R.J., van der Vliet, P.J., Klooster, R., Sacconi, S., Ca-

mano, P., Dauwerse, J.G., Snider, L., Straasheijm, K.R., van

Ommen, G.J., Padberg, G.W., et al. (2010). A unifying genetic

model for facioscapulohumeral muscular dystrophy. Science

329, 1650–1653.

19. de Greef, J.C., Lemmers, R.J., Camano, P., Day, J.W., Sacconi,

S., Dunand, M., van Engelen, B.G., Kiuru-Enari, S., Padberg,

G.W., Rosa, A.L., et al. (2010). Clinical features of facioscapu-

lohumeral muscular dystrophy 2. Neurology 75, 1548–1554.

20. Scionti, I., Fabbri, G., Fiorillo, C., Ricci, G., Greco, F., D’Amico,

R., Termanini, A., Vercelli, L., Tomelleri, G., Cao, M., et al.

(2012). Facioscapulohumeral muscular dystrophy: new

insights from compound heterozygotes and implication for

prenatal genetic counselling. J. Med. Genet. 49, 171–178.

21. Padberg, G.W., Lunt, P.W., Koch, M., and Fardeau, M. (1991).

Diagnostic criteria for facioscapulohumeral muscular

dystrophy. Neuromuscul. Disord. 1, 231–234.

22. Butz, M., Koch, M.C., Muller-Felber, W., Lemmers, R.J., van

der Maarel, S.M., and Schreiber, H. (2003). Facioscapulohum-

eral muscular dystrophy. Phenotype-genotype correlation in

patients with borderline D4Z4 repeat numbers. J. Neurol.

250, 932–937.

23. Lamperti, C., Fabbri, G., Vercelli, L., D’Amico, R., Frusciante,

R., Bonifazi, E., Fiorillo, C., Borsato, C., Cao, M., Servida, M.,

et al. (2010). A standardized clinical evaluation of patients

affected by facioscapulohumeral muscular dystrophy: The

FSHD clinical score. Muscle Nerve 42, 213–217.

24. Wohlgemuth, M., Lemmers, R.J., van der Kooi, E.L., van der

Wielen, M.J., van Overveld, P.G., Dauwerse, H., Bakker, E.,

Frants, R.R., Padberg, G.W., and van der Maarel, S.M. (2003).

Possible phenotypic dosage effect in patients compound het-

erozygous for FSHD-sized 4q35 alleles. Neurology 61, 909–913.

25. Casasnovas, C., Cano, L.M., Albertı, A., Cespedes, M., and

Rigo, G. (2008). Charcot-Marie-tooth disease. Foot Ankle

Spec 1 (6, Spec.), 350–354.

26. Takahashi, M., Asai, N., Iwashita, T., Murakami, H., and Ito, S.

(1998). Mechanisms of development of multiple endocrine

neoplasia type 2 and Hirschsprung’s disease by ret mutations.

Recent Results Cancer Res. 154, 229–236.

27. Kanagawa, M., and Toda, T. (2006). The genetic andmolecular

basis of muscular dystrophy: roles of cell-matrix linkage in the

pathogenesis. J. Hum. Genet. 51, 915–926.

28. Chahwan, R., Wontakal, S.N., and Roa, S. (2011). The multidi-

mensional nature of epigenetic information and its role in

disease. Discov. Med. 11, 233–243.

29. Vitelli, F., Villanova, M., Malandrini, A., Bruttini, M., Piccini,

M., Merlini, L., Guazzi, G., and Renieri, A. (1999). Inheritance

of a 38-kb fragment in apparently sporadic facioscapulohum-

eral muscular dystrophy. Muscle Nerve 22, 1437–1441.

30. Weiffenbach, B., Bagley, R., Falls, K., Hyser, C., Storvick, D., Ja-

cobsen, S.J., Schultz, P., Mendell, J., Willems van Dijk, K., Mil-

ner, E.C., et al. (1992). Linkage analyses of five chromosome 4

markers localizes the facioscapulohumeral muscular

dystrophy (FSHD) gene to distal 4q35. Am. J. Hum. Genet.

51, 416–423.

31. Lecky, B.R., MacKenzie, J.M., Read, A.P., and Wilcox, D.E.

(1991). X-linked and FSH dystrophies in one family. Neuro-

muscul. Disord. 1, 275–278.

32. Felice, K.J., North, W.A., Moore, S.A., and Mathews, K.D.

(2000). FSH dystrophy 4q35 deletion in patients presenting

with facial-sparing scapular myopathy. Neurology 54, 1927–

1931.

33. van der Kooi, A.J., Visser, M.C., Rosenberg, N., van den Berg-

Vos, R., Wokke, J.H., Bakker, E., and de Visser, M. (2000).

Extension of the clinical range of facioscapulohumeral

dystrophy: report of six cases. J. Neurol. Neurosurg. Psychiatry

69, 114–116.

34. Krasnianski, M., Eger, K., Neudecker, S., Jakubiczka, S., and

Zierz, S. (2003). Atypical phenotypes in patients with

634 The American Journal of Human Genetics 90, 628–635, April 6, 2012

facioscapulohumeral muscular dystrophy 4q35 deletion.

Arch. Neurol. 60, 1421–1425.

35. Chuenkongkaew, W.L., Lertrit, P., Limwongse, C., Nilanont,

Y., Boonyapisit, K., Sangruchi, T., Chirapapaisan, N., and Su-

phavilai, R. (2005). An unusual family with Leber’s hereditary

optic neuropathy and facioscapulohumeral muscular

dystrophy. Eur. J. Neurol. 12, 388–391.

36. Filosto, M., Tonin, P., Scarpelli, M., Savio, C., Greco, F., Man-

cuso, M., Vattemi, G., Govoni, V., Rizzuto, N., Tupler, R., and

Tomelleri, G. (2008). Novel mitochondrial tRNA Leu(CUN)

transitionandD4Z4partialdeletion inapatientwitha faciosca-

pulohumeral phenotype. Neuromuscul. Disord. 18, 204–209.

37. Rudnik-Schoneborn, S., Weis, J., Kress, W., Hausler, M., and

Zerres, K. (2008). Becker’s muscular dystrophy aggravating

facioscapulohumeral muscular dystrophy—double trouble as

an explanation for an atypical phenotype. Neuromuscul.

Disord. 18, 881–885.

38. Korngut, L., Siu, V.M., Venance, S.L., Levin, S., Ray, P., Lem-

mers, R.J., Keith, J., and Campbell, C. (2008). Phenotype of

combined Duchenne and facioscapulohumeral muscular

dystrophy. Neuromuscul. Disord. 18, 579–582.

39. Zouvelou, V., Manta, P., Kalfakis, N., Evdokimidis, I., and Vas-

silopoulos, D. (2009). Asymptomatic elevation of serum crea-

tine kinase leading to the diagnosis of 4q35 facioscapulohum-

eral muscular dystrophy. J. Clin. Neurosci. 16, 1218–1219.

40. Tsuji, M., Kinoshita, M., Imai, Y., Kawamoto, M., and Kohara,

N. (2009). Facioscapulohumeral muscular dystrophy present-

ing with hypertrophic cardiomyopathy: a case study. Neuro-

muscul. Disord. 19, 140–142.

41. Reilich, P., Schramm,N., Schoser, B., Schneiderat, P., Strigl-Pill,

N., Muller-Hocker, J., Kress, W., Ferbert, A., Rudnik-Schone-

born, S., Noth, J., et al. (2010). Facioscapulohumeral muscular

dystrophy presenting with unusual phenotypes and atypical

morphological features of vacuolar myopathy. J. Neurol.

257, 1108–1118.

42. Jordan, B., Eger, K., Koesling, S., and Zierz, S. (2011). Campto-

cormia phenotype of FSHD: a clinical and MRI study on six

patients. J. Neurol. 258, 866–873.

43. Tonini, M.M., Passos-Bueno, M.R., Cerqueira, A., Pavanello,

R., Vainzof, M., Dubowitz, V., and Zatz, M. (2002). Facioscapu-

lohumeral (FSHD1) and other forms of muscular dystrophy in

the same family: is there more in muscular dystrophy than

meets the eye? Neuromuscul. Disord. 12, 554–557.

44. Ricci, G., Scionti, I., Alı, G., Volpi, L., Zampa, V., Fanin, M.,

Angelini, C., Politano, L., Tupler, R., and Siciliano, G. (2012).

Rippling muscle disease and facioscapulohumeral dystrophy-

like phenotype in a patient carrying a heterozygous CAV3

T78Mmutation and a D4Z4 partial deletion: Further evidence

for ‘‘double trouble’’ overlapping syndromes. Neuromuscul.

Disord., in press. Published online January 13, 2012. 10.

1016/j.nmd.2011.12.001.

45. Griggs, R.C., Tawil, R., McDermott, M., Forrester, J., Figlewicz,

D., andWeiffenbach, B.; FSH-DY Group. (1995). Monozygotic

twins with facioscapulohumeral dystrophy (FSHD): implica-

tions for genotype/phenotype correlation. Muscle Nerve 2,

S50–S55.

46. Tupler, R., Barbierato, L., Memmi, M., Sewry, C.A., De Grandis,

D., Maraschio, P., Tiepolo, L., and Ferlini, A. (1998). Identical

de novo mutation at the D4F104S1 locus in monozygotic

male twins affected by facioscapulohumeral muscular

dystrophy (FSHD) with different clinical expression. J. Med.

Genet. 35, 778–783.

The American Journal of Human Genetics 90, 628–635, April 6, 2012 635

ARTICLE

Combined Analysis of Genome-wide AssociationStudies for Crohn Disease and PsoriasisIdentifies Seven Shared Susceptibility Loci

David Ellinghaus,1 Eva Ellinghaus,1 Rajan P. Nair,2 Philip E. Stuart,2 Tonu Esko,3,4 Andres Metspalu,3,4

Sophie Debrus,5 John V. Raelson,6 Trilokraj Tejasvi,2 Majid Belouchi,7 Sarah L. West,8 Jonathan N. Barker,8

Sulev Koks,9 Kulli Kingo,10 Tobias Balschun,1 Orazio Palmieri,11 Vito Annese,11,12 Christian Gieger,13

H. Erich Wichmann,14,15,16 Michael Kabesch,17 Richard C. Trembath,8 Christopher G. Mathew,8

Goncalo R. Abecasis,18 Stephan Weidinger,19 Susanna Nikolaus,20,21 Stefan Schreiber,1,21 James T. Elder,2,22

Michael Weichenthal,19Michael Nothnagel,23,24 and Andre Franke1,24,*

Psoriasis (PS) and Crohn disease (CD) have been shown to be epidemiologically, pathologically, and therapeutically connected, but little

is known about their shared genetic causes. We performedmeta-analyses of five published genome-wide association studies on PS (2,529

cases and 4,955 controls) and CD (2,142 cases and 5,505 controls), followed up 20 loci that showed strongest evidence for shared disease

association and, furthermore, tested cross-disease associations for previously reported PS and CD risk alleles in additional 6,115 PS cases,

4,073 CD cases, and 10,100 controls. We identified seven susceptibility loci outside the human leukocyte antigen region (9p24 near

JAK2, 10q22 at ZMIZ1, 11q13 near PRDX5, 16p13 near SOCS1, 17q21 at STAT3, 19p13 near FUT2, and 22q11 at YDJC) shared between

PS and CD with genome-wide significance (p< 53 10�8) and confirmed four already established PS and CD risk loci (IL23R, IL12B, REL,

and TYK2). Three of the shared loci are also genome-wide significantly associated with PS alone (10q22 at ZMIZ1, prs1250544 ¼3.53 3 10�8, 11q13 near PRDX5, prs694739 ¼ 3.71 3 10�09, 22q11 at YDJC, prs181359 ¼ 8.02 3 10�10). In addition, we identified one

susceptibility locus for CD (16p13 near SOCS1, prs4780355 ¼ 4.99 3 10�8). Refinement of association signals identified shared genome-

wide significant associations for exonic SNPs at 10q22 (ZMIZ1) and in silico expression quantitative trait locus analyses revealed that

the associations at ZMIZ1 and near SOCS1 have a potential functional effect on gene expression. Our results show the usefulness of joint

analyses of clinically distinct immune-mediated diseases and enlarge the map of shared genetic risk loci.

Introduction

Psoriasis (PS [MIM 177900]) and Crohn disease (CD [MIM

266600]) are both chronic inflammatory epithelial disorders

that are triggered by an activated cellular immune system

and have an estimated sibling relative risk (ls) of 4–111,2

and 25–42,3 respectively, and a prevalence of 2%–3% and

about 0.1%, respectively, in populations of European

ancestry.4,5 PS is a common hyperproliferative disorder of

the skin, characterized by red scaly plaques, typically occur-

ringon the elbows, knees, scalp, and lowerback.6 Incontrast,

CD is primarily a gut disorder affecting any aspect of the

gastrointestinal tract butwith extraintestinalmanifestations

which might also affect the skin (e.g., erythema nodosum

and pyoderma gangraenosum).7 It results from the interac-

tion of environmental factors, including the commensal

microflora, with host immune mechanisms in a genetically

susceptible host.8 Although PS and CD are clinically

distinct diseases, they are observed togethermore frequently

than expected by chance, which could indicate shared

genetic factors acting in the etiology of both diseases.9–11

Recently, several genome-wide association studies (GWASs)

have successfully been carried out separately for CD

and PS12–20 and identified shared susceptibility genes,

such as IL23R (MIM 607562), IL12B (MIM 161561),

REL (MIM 164910), and TYK2 (MIM 176941), thereby

providing further evidence for a genetic overlap of

both diseases.13,16,21–23 One of the best characterized

risk loci for bothCDandPS is IL23R, located in a drug-target-

able pathway.24 IL-23, a pro-inflammatory cytokine, is

1Institute of Clinical Molecular Biology, Christian-Albrechts-University, 24105 Kiel, Germany; 2Department of Dermatology, University of Michigan, Ann

Arbor, MI 48109, USA; 3Estonian Genome Center, University of Tartu, 50409 Tartu, Estonia; 4Institute of Molecular and Cell Biology, University of Tartu,

50409 Tartu, Estonia; 5Gatineau, QC J9J 2X6, Canada; 6PGX-Services, Montreal, QC H2T 1S1, Canada; 7Genizon BioSciences, Inc., St. Laurent, QC H4T

2C7, Canada; 8Division of Genetics and Molecular Medicine, King’s College London, London SE1 9RT, UK; 9Department of Physiology, Centre of Trans-

lational Medicine and Centre of Translational Genomics, University of Tartu, 50409 Tartu, Estonia; 10Department of Dermatology and Venerology, Univer-

sity of Tartu, 50409 Tartu, Estonia; 11Division of Gastroenterology, Istituto di Ricovero e Cura a Carattere Scientifico-Casa Sollievo della Sofferenza Hospital,

San Giovanni Rotondo 71013, Italy; 12Unit of Gastroenterology SOD2, Azienda Ospedaliero Universitaria Careggi, Florence 50134, Italy; 13Institute of

Genetic Epidemiology, Helmholtz Centre Munich, German Research Center for Environmental Health, 85764 Neuherberg, Germany; 14Institute of Epide-

miology I, Helmholtz Centre Munich, German Research Center for Environmental Health, 85764 Neuherberg, Germany; 15Institute of Medical Infor-

matics, Biometry and Epidemiology, Ludwig-Maximilians-University, 81377 Munich, Germany; 16Klinikum Grosshadern, 81377 Munich, Germany;17Department of Paediatric Pneumology, Allergy and Neonatology, Hannover Medical School, 30625 Hannover, Germany; 18Department of Biostatistics,

Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; 19Department of Dermatology, Allergology, and Venerology, University

Hospital Schleswig-Holstein, Christian-Albrechts-University, 24105 Kiel, Germany; 20PopGen Biobank, Christian-Albrechts-University Kiel, 24105 Kiel,

Germany; 21Department of General Internal Medicine, University Hospital Schleswig-Holstein, 24105 Kiel, Germany; 22Ann Arbor Veterans Affairs Hopital,

Ann Arbor, MI 48105, USA; 23Institute of Medical Informatics and Statistics, Christian-Albrechts University, 24105 Kiel, Germany24These authors contributed equally to this work

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.020. �2012 by The American Society of Human Genetics. All rights reserved.

636 The American Journal of Human Genetics 90, 636–647, April 6, 2012

thought to be a key player driving autoimmunity in

human disease.25 A recent functional characterization of

the amino acid substitution R381Q in IL23R suggests that

IL-23-induced Th17 cell effector function is reduced in

protective allele carriers and leads to protection against

several autoimmune diseases, including PS, CD, and

ankylosing spondylitis.26

So far, shared susceptibility loci for CD and PS have been

identified by single-disease GWAS for CD or PS separately,

and established risk SNPs for one disease are usually tested

for association in another disease,27–31 rather than in a

combined systematic approach. Combined GWASs were

only conducted across clinically related phenotypes, such

as CD and ulcerative colitis (UC),32 CD and sarcoidosis

(SA)27 or CD and celiac disease (CelD).33 Recently, Zherna-

kova et al.34 performed a meta-analysis with a similar

systematic approach that combined genome-wide geno-

type data from two autoimmune diseases affecting dif-

ferent organs, namely CelD and rheumatoid arthritis

(RA). They identified eight shared risk loci outside the

human leukocyte antigen (HLA) region for CelD and RA,

four of them previously not known to be associated with

either CelD or RA. Based on genome-wide SNP data for

CelD and RA, the authors identified an increased proba-

bility for CelD risk SNPs to confer also an increase in risk

for RA and vice versa. Zhernakova et al.34 postulated

criteria for declaring a SNP as being a shared risk factor

for two clinically distinct diseases, namely that shared

SNPs (1) have to reach genome-wide significance in the

combined analysis of the initial GWAS screening stage

and the replication stage of the two distinct diseases

(pGWASþRepl < 5 3 10�8) and (2) have to achieve, for each

disease separately, nominal significance in the replication

stage (pRepl < 0.05) as well as pGWASþRepl < 10�3 in the

combined analysis of screening and replication stage.

We used the criteria proposed by Zhernakova et al.34 in

a genome-wide association analysis combining CD and

PS to systematically identify shared risk loci associated

with both diseases. A two-fold strategy was employed: in

a first approach (OVERLAP), we tested established non-

HLA CD risk SNPs for association with PS and vice versa,

thereby seeking confirmation of whether known risk loci

for one disease also play a role in the etiology of the other

disease, disregarding the direction of effect. In a second

approach (COMBINED), we performed a meta-analysis for

the combined phenotype based on genome-wide data

sets of both CD and PS in order to increase power for the

detection of new shared risk alleles because of an increased

sample size. The latter approach allows consideration of

same-direction as well as opposing-direction allelic effects

of putative shared markers between CD and PS through

the use of suitable allele coding. We followed up 20 loci

that showed the strongest association in the COMBINED

approach and that had not been previously reported

as being risk factors for either CD or PS. Follow-up

was performed in independent replication panels from

Germany, Estonia, Italy, United Kingdom, and the United

States (see Table S1, available online).

Subjects and Methods

Study SubjectsWe analyzed a collection of different data sets. Figure 1 details the

different panels and their use in this study. For discovery, we

combined genome-wide case-control data of single-nucleotide

polymorphisms (SNP) for psoriasis (PS; panel A) and Crohn disease

(CD; panel B) (see Table S1) respectively, conducted genome-wide

meta-analyses on PS and CD, respectively, and employed two

strategies (COMBINED and OVERLAP, see below) to systematically

identify risk loci associated with both diseases. Replication was

Figure 1. Study Design for the CombinedAnalysis of CD and PSFor discovery, we conducted a PS and a CDGWASmeta-analysis (panel A and B in Table S1), respec-tively, and employed two strategies (OVERLAPand COMBINED) to systematically search forshared risk loci. In a first approach (OVERLAP),we tested established non-HLA PS risk SNPs forpotential association (p < 0.01) with CD andvice versa. In a second approach (COMBINED),we selected SNPs from 20 loci for being nomi-nally associated in each of the single-diseasemeta-analyses (ppanel A < 0.05, ppanel B < 0.05)and for being significantly associated in thecombined-phenotype association analysis at the10�4 level (ppanel A&B < 10�4). For replication,follow-up SNP genotyping was performed forCD and PS in independent replication panels(panels C–E in Table S1). The following abbrevia-tions are used: PS-GER, German PS GWAS; PS-US,United States PS GWAS; PS-Canada, CanadianPS GWAS; CD-GER, German CD GWAS; andCD-UK, United Kingdom CD GWAS. For eachpanel, numbers of cases/controls are displayedin parentheses.

The American Journal of Human Genetics 90, 636–647, April 6, 2012 637

performed in independent replication panels from Germany (see

panels C–E), Estonia (panels C and D), Italy (panel E), United

Kingdom (panels C and E), and the United States (panel C) (see

Table S1). Written, informed consent was obtained from all study

participants and all protocols were approved by the institutional

ethical review committees of the participating centers.

Initial GWAS and German Replication Data

All German CD patients in discovery and replication panels (B and

C–E, respectively) were recruited either at the Department of

General Internal Medicine, Christian-Albrechts-University, Kiel,

and the Charite University Hospital, Berlin; through local outpa-

tient services; or nationwide with the support of the German

Crohn and Colitis Foundation. German PS cases in discovery

and replication panels (A and C andD, respectively) were recruited

either at the Department of Dermatology, Christian-Albrechts-

University, Kiel, or the Department of Dermatology and Allergy,

Technical University, Munich, or through local outpatient

services. Individuals were considered to be affected by PS if chronic

plaque or guttate psoriasis lesions covered more than 1% of the

total body surface area or if at least two skin, scalp, nail, or joint

lesions were clinically diagnosed as psoriasis. The 4,680 German

healthy control individuals in discovery and replication panels

(A–E) were obtained from the Popgen biobank.35 The additional

3,391 German healthy controls (after quality control measures)

in the discovery panels (A and B) were selected from the KORA

S3þS4 survey, an independent population-based sample from

the general population living in the region of Augsburg, southern

Germany.36 Another 674 German healthy controls in the

discovery panels (A and B) were selected from ISAAC Phase II

study.37 German GWAS controls of the discovery phase were

randomly assigned to panels A and B at equal proportions, while

ensuring that controls in panel A did not overlap with German

GWAS controls used in the independent genome-wide meta-anal-

ysis on CD.13 The Collaborative Association Study of Psoriasis

(CASP) samples14 consisted of 1,303 PS cases and 1,322 controls

after quality control measures and are part of panel A in our study.

The data sets used for the analyses described in this manuscript

were obtained from the database of Genotype and Phenotype

(dbGaP). The genotyping of samples was provided through the

Genetic Association Information Network (GAIN).38 The Cana-

dian samples (from Genizon BioSciences andM. Belouchi, unpub-

lished data) consisted of 757 PS cases and 987 controls sampled

from the Quebec founder population (QFP) after quality control

measures. They are part of panel A. Membership in the QFP was

defined as having four grandparents with French-Canadian family

names who were born in the Province of Quebec, Canada, or in

adjacent areas in the provinces of New Brunswick and Ontario

or in New England or New York state. This criterion assured that

all subjects were descendants of French-Canadians living before

the 1960s, after which time admixture with non-French-Cana-

dians became more common. CD cases and controls from the

United Kingdom were recruited from the 1958 birth cohort and

UK National Blood Service for the Welcome Trust Case Controls

Consortium (WTCCC) (described in details in WTCCC28). The

WTCCC1 CD samples consisted of 1,662 CD cases and 2,860

healthy controls after quality control procedures and entered the

analysis as part of panel B.

Additional Replication Data

Anumber of collaborative data sets were used as replication panels.

The Estonian samples used in the OVERLAP and COMBINED

approaches (part of panels C and D) were collected at the Depart-

ment of Dermatology and Venerology and at the Department of

Physiology and Centre of Translational Medicine at the University

of Tartu.

Additional Estonian samples (part of panel C) used for replica-

tion in the OVERLAP approach consisted of samples provided by

the population-based biobank of the Estonian Genome Center,

University of Tartu. Subjects were recruited by general practi-

tioners (GP) and physicians in the hospitals. Participants in the

hospitals were randomly selected from individuals visiting GP

offices or hospitals. Diagnosis of PS on the basis of clinical symp-

toms was posed by a general practitioner and confirmed by

a dermatologist. At the moment of recruitment, the controls did

not report diagnosis of osteoarthritis, psoriasis, or autoimmune

diseases. The United States samples (part of panel C) used for repli-

cation in the OVERLAP approach consisted of 2,137 PS cases and

1,903 controls of white European ancestry from the United States.

The Italian samples (panel E) used in the COMBINED approach

consisted of 688 CD cases and 879 healthy controls that were

used in the independent genome-wide meta-analysis on CD.13

The psoriasis data set from the United Kingdom consisted of

2,178 PS cases collected through the Genetic Analysis of Psoriasis

Consortium (GAPC) and 2,657 controls from the WTCCC2

common control set, used as the GWAS discovery set described

in Strange et al. in 2010.16 Only controls that did not overlap

with WTCCC1 controls were used. UK cases and controls entered

the analysis as part of panel C and E.

Quality Control and Genome-wide Genotype

ImputationQuality control (QC) was performed for each sample set separately.

In each sample set samples with more than 5% missing data

were excluded before genotype imputation. We also excluded

individuals from each pair of unexpected duplicates or relatives,

as well as outlier individuals with average marker heterozygosities

of55 standard deviation away from the samplemean. The remain-

ing samples were tested for population stratification with the

principal components stratification method as implemented in

EIGENSTRAT,39 and population outliers were subsequently

excluded. SNPs that hadmore than 5%missing data, aminor allele

frequency less than 1% or deviated from Hardy-Weinberg equilib-

rium (exact p < 10�4 in controls) per sample set were excluded

with thePLINKsoftwareversion1.07.40 SNP imputationwascarried

out with the BEAGLE v.3.1.141 software package and 690HapMap3

referencehaplotypes fromtheCEU,TSI,MEX, andGIHcohorts42 to

predict missing autosomal genotypes in silico. We subsequently

analyzed only those SNPs that could be imputed with moderate

confidence (INFO score r2 > 0.3) and had a minor allele frequency

more than 1% in cases or in controls. To take imputation uncer-

tainty into account, phenotypic association was tested for allele

dosage data separately for each of the five GWAS data sets in panels

AandB through theuseof PLINK’s logistic regression framework for

dosage data. To control potentially confounding effects due to pop-

ulation stratification, we adjusted for the top ten eigenvectors from

EIGENSTRAT in the regression analysis. The genomic inflation

factor l is defined as the ratio of the medians of the sample c2 test

statistics and the 1 degree of freedom c2 distribution (0.455).43

Because the estimated genomic inflation factor l scaleswith sample

size, it is informative to report the inflation factor for an equivalent

study of 1,000 cases and 1,000 controls (l1000) by rescaling l.44

Meta-AnalysesMeta-analyses were performed with PLINK’s meta-analysis func-

tion and with its standard error of odds ratio weighting option

638 The American Journal of Human Genetics 90, 636–647, April 6, 2012

(inverse variance weighting), which implicitly deals with imputa-

tion uncertainty. For the combined-phenotype analysis, we per-

formed two sorts of meta-analysis in order to detect associations

of SNPs with either the same or opposite allelic effects in the two

diseases. For the same-effect analysis, themeta-analysis was carried

out as usual. For the opposite-effect analysis, first we flippedminor

andmajor alleles of each biallelic SNP in the CD data sets tomimic

an opposite-direction effect of the allele in CD and performed

a meta-analysis afterward. For both effects models, we considered

only those SNPs whose genotypes were available from at least

four out of the five GWAS data sets in panels A and B.

Follow-Up GenotypingGenotyping was carried out with our Sequenom iPlex plat-

form from Sequenom and TaqMan technology from Applied

Biosystems. Individuals with more than 3% missing data were

removed. SNPs that hadmore than 3%missing data, a minor allele

frequency less than 1% or deviated from Hardy-Weinberg

equilibrium (exact p < 10�4 in controls) per sample set were

excluded. p values for allele-based tests of phenotypic association

for each single-replication sample sets (panels C–E) were calcu-

lated with PLINK. PLINK’s meta-analysis function was used to

obtain p values for the replication data set (pRepl) and for the

combined discovery-replication data set (pGWASþRepl).

Regional Imputation Based on the 1000 Genomes

Project ReferenceTo enable imputation based on the 1000 Genomes project

data, SNPpositions referring toNCBIbuild36weremapped tobuild

37. SNP imputation was carried out with the BEAGLE software

package v.3.1.141 and 566 EUR (European) haplotypes generated

by the 1000 Genomes Project.45 We analyzed only imputed SNPs

with moderate imputation confidence (INFO score r2 > 0.3) and

a minor allele frequency more than1% in cases or in controls.

Gene Relationships across Implicated Loci Pathway

AnalysisThe Gene Relationships Across Implicated Loci (GRAIL) software46

quantifies functional similarity between genes by applying

established statistical text mining methods to the PubMed

database of published scientific abstracts. As input we used

the following list of SNPs: rs2201841, rs2082412, rs702873,

rs12720356, rs10758669, rs694739, rs281379, rs181359,

rs4780355, rs744166, and rs1250544. GRAIL was run with the

following settings: HapMap release ¼ HapMap release 22/hg18;

HapMap population ¼ CEU (Utah residents with ancestry from

northern and western Europe from the Centre d0Etude du Poly-

morphisme Humain collection); functional data source¼ PubMed

Text (April 2011); and gene size correction ¼ on. GRAIL output

results were visualized with VIZ-GRAIL.

Results

Preparation of Single-Disease Meta-Analyses for

Discovery Phase by Means of HapMap3 Imputation

The overall study workflow for the combined analysis of

CD and PS is displayed in Figure 1. For discovery, we con-

ducted a meta-analysis on PS comprising 2,529 PS cases

and 4,955 controls from three previously published

GWASs,14,15 all of European descent (panel A in Table

S1). SNP data were combined with genotype imputation

based on the HapMap3 reference. We subsequently used

standard meta-analysis methodology (see Subjects and

Methods). In total, 1,121,166 quality-controlled auto-

somal-imputed SNPmarkers were available for the analysis

on PS. To control for potential population stratification, we

adjusted association test statistics by means of principal

component analysis (PCA) (see Subjects and Methods). A

quantile-quantile (Q-Q) plot of the meta-analysis revealed

a marked excess of significant associations in the tail of

the distribution (Figure S1A), which is primarily due to

thousands of highly significant association signals from

the HLA region. Genetic heterogeneity was low; there

was an estimated genomic inflation factor of l1000 ¼1.02743,44 (see Subjects and Methods). Results of the

meta-analysis on PS are summarized in Figure S2A.

In the same way, we performed ameta-analysis on CD by

using 1,034,639 quality-controlled autosomal-imputed

markers fromaGermanGWAS13 andapreviouslypublished

UK GWAS,28 consisting of 2,142 CD cases and 5,505

controls in total (panel B in Table S1). Again, we observed

low genomic inflation (l1000 ¼ 1.032, Figure S1B). Results

of the meta-analysis on CD are summarized in Figure S2B.

OVERLAP Approach: Cross-Disease Analysis of

Established Risk SNPs

So far, four established GWAS risk loci that are located

outside the HLA region and shared between CD and PS

have been reported in the literature, namely IL23R, IL12B,

REL, andTYK2 (Table 1). Although themarkers that showed

the strongest association differed between the two diseases

at each of the first three loci, we found the same SNP

rs12720356 at TYK2 to be associated with both CD and

PS. To seek confirmation of whether known risk loci for

CD also play a role in the etiology of PS and vice versa, we

checked whether markers that were significant (p < 0.01)

in our PS meta-analysis were among the 71 established

risk SNPs previously implicated inCD13 andwhether signif-

icant markers (p < 0.01) from our CD meta-analysis were

among the25 establishedPS risk SNPs.14–20 The four already

established shared risk SNPs (Table 1) were excluded. Given

the well-established and heterogeneous allelic associations

of the HLA region on chromosome 6 with both CD and

PS, we also excluded all markers from the extended HLA

region (chr6:25-34 Mb). Although none of the known PS

SNPs were significantly associated with CD in our analysis,

five out of the 71 known CD risk SNPs met our criterion

of significance, namely rs10758669 (JAK2 [MIM

147796]), rs694739 (PRDX5 [MIM 606583]), rs281379

(FUT2 [MIM 182100]), rs744166 (STAT3 [MIM 102582]),

and rs181359 (YDJC;HGNC 27158). We genotyped these

five SNPs by using TaqMan technology in a large indepen-

dent replication panel comprising 3,937 PS cases and

4,847 controls but also used summary statistics data of

the five SNPs from an independent GWAS on PS16

comprising 2,178 PS cases and 2,657 controls. The overall

replication panel consisted of 6,115 PS cases and 7,504

The American Journal of Human Genetics 90, 636–647, April 6, 2012 639

controls (panel C, Table S1). We performed single-marker

association tests for panel C (pPS-Repl) and conducted

a meta-analysis (pPS-GWASþRepl) by combining association

results from the GWAS (pPS-GWAS) and the replication

(pPS-Repl) stages (Table 2). All of the CD risk SNPs were

also significantly associated with PS at the previously

proposed level34 of pPS-Repl < 0.05 and pPS-GWASþRepl <

10�3 (rs10758669 near JAK2, rs694739 near PRDX5,

rs281379 near FUT2, and rs181359 at YDJC, rs744166

at STAT3). These five SNPs have already been reported

to show significant association at the genome-wide level

(pCD-GWASþRepl < 5 3 10�8, pCD-Repl < 0.05, and

pCD-GWASþRepl < 10�3) in a very large, independent

genome-wide meta-analysis on CD roughly three times

the size of this one, that is comprising 6,333 CD cases

and 15,056 controls.13 All five SNPs achieved genome-

wide significance in the combined analysis of PS discovery

panel A, PS replication panel C, and CD discovery data

from Franke et al.13 (pCDPS-GWASþRepl < 5 3 10�8). Further-

more, SNP rs694739, 7.9 kb downstream of PRDX5, as well

as SNP rs181359, 53.7 kb downstream of YDJC, reached

genome-wide significance for PS only (prs694739 ¼ 3.71 3

10�09 and prs181359 ¼ 8.02 3 10�10). We also observed

a highly significant association at the FUT2 locus

(prs281379 ¼ 7.86 3 10�08) for PS only.

COMBINED Approach, Part 1: Meta-Analysis

Considering Same-Direction Effects

In order to identify additional shared genetic susceptibility

loci in CD and PS, we performed a meta-analysis of the

combined phenotype where CD and PS were considered

as a single phenotype. The disease-specific meta-analyses

(panels A and B) were merged to form a combined-pheno-

type meta-analysis discovery panel comprising 2,142 CD

cases, 2,529 PS cases, and 10,460 healthy controls. In total,

1,123,777 quality-controlled autosomal markers were

available for the analysis in at least four out of the five

GWAS data sets. As with the OVERLAP approach, we

excluded all markers from the extended HLA region

(chr6:25-34Mb), leaving 1,116,213 autosomal SNPs for

screening of shared risk loci. We observed only low

genomic inflation for the same-direction meta-analysis

(l1000 ¼ 1.023; Figure S3A). After exclusion of established

loci for PS and CD, the inflation factors further decreased

(Figure S3C). To provide proof of principle for our

approach, we first examined association signals at the three

established shared risk loci with same-direction effects of

alleles for CD and PS (see Table 1). We observed highly

significant association signals for all three loci (pIL23R ¼1.82 3 10�22, pIL12B ¼ 3.32 3 10�7, and pREL ¼ 1.53 3

10�7; Figures S4A–S4C). Subsequently, we selected SNPs

for being nominally associated in each of the single-disease

meta-analyses (pCD-GWAS < 0.05, pPS-GWAS < 0.05) and for

being significantly associated in the combined-phenotype

association analysis at the 10�4 level (pCDPS-GWAS < 1 3

10�4). This resulted in 17 SNPs located at 17 distinct loci.

Except for the four known shared loci (see Table 1), we

did not exclude established risk loci from either CD or PS

to maintain the chance of detecting shared risk alleles at

these loci. Because one of the 17 SNPs is located at 10q22

(ZMIZ1 [MIM 607159]), which is an established CD risk

locus, we added the established CD-associated SNP

rs1250550 from this region to the list of follow-up SNPs.

We then genotyped these 18 SNPs in an independent panel

of 1,713 CD cases, 1,009 PS cases and 3,565 controls (panel

D, Table S1D) by using the Sequenom iPlex platform.

Association results for all 18 SNPs are shown in Table S2.

The strongest association was observed at ZMIZ1 for SNP

rs1250544 (pCDPS-GWAS¼ 1.123 10�5 and pCDPS-GWASþRepl¼2.66 3 10�10; see Table S2 and Figure S5A) and yielded

genome-wide significance in the same-effect combined-

phenotype analysis of discovery panels A andB and replica-

tion panel D. SNP rs1250544 reached also genome-wide

significance for PS alone (pPS-GWASþRepl ¼ 3.90 3 10�8). A

robust association with CD, but not with PS, was observed

at SOCS1 (MIM 603597) on chromosomal region 16p13

(pCDPS-GWAS ¼ 9.36 3 10�7, pCD-GWAS ¼ 1.47 3 10�3, and

pCD-GWASþRepl ¼ 1.01 3 10�7 for rs4780355; see Table S2

and Figure S5B). We further corroborated both association

signals by genotyping SNPs rs1250544 and rs4780355 in

additional sample sets fromGermany and Italy, comprising

2,360 CD cases and 1,015 healthy controls, but we also

Table 1. Established Risk Loci Shared between CD and PS from Published Studies on CD and PS, Respectively

LocusGenes ofInteresta

Top CDdbSNP IDb

RiskAllele OR

Top PSdbSNP IDb

RiskAllele OR

LD between CD andPS SNPs (D0/r2) Comment

1p31 IL23R rs11209026c G 2.66 rs2201841f G 1.13 1.0/0.018 different markers

5q33 IL12B rs6556412d A 1.18 rs2082412f G 1.44 0.715/0.269 different markers

2p16 REL rs10181042e T 1.14 rs702873g G 1.12 0.036/0.001 different markers

19p13 TYK2 rs12720356e G 1.12 rs12720356g T 1.40 same marker same marker, oppositedirection of effect

aCandidate genes of interest are listed for the locus.bLead SNP with most significant association within a locus, as stated in the reference publication.cSee Franke et al.13 and Duerr et al.21dSee Barrett et al.12 and Franke et al.13eSee Franke et al.13fSee Cargill et al.22 and Nair et al.23gSee Strange et al.16

640 The American Journal of Human Genetics 90, 636–647, April 6, 2012

used summary statistics data of the two SNPs from the inde-

pendent GWAS on PS16 comprising 2,178 PS cases and

2,657 controls (panel E in Table S1, the same cases and

controls from the United Kingdom, as described in panel

C). In the combined analysis of discovery panels A

and B and replication panels D and E (Tables 3 and 4),

SNP rs4780355 achieved genome-wide significance

(pCDPS-GWASþRepl ¼ 1.37 3 10�13) but also attained

genome-wide significance for CD alone (pCD-GWASþRepl ¼4.99 3 10�8).

COMBINED Approach, Part 2: Meta-Analysis

Assuming Opposite-Direction Effects

An allele might confer a risk for CD while protecting

against PS and vice versa, as is the case for TYK2. Therefore,

we also screened our combined-phenotype meta-analysis

data (panels A and B) while coding alleles in such a way

as to consider the opposite effects of them in the two

diseases (see Subjects and Methods). We observed low

genomic inflation for the opposite-direction meta-analysis

(l1000 ¼ 1.009, Figure S3B). After excluding established

shared loci for PS and CD, the inflation factors further

decreased (Figure S3D). In a first step, we checked the

known risk SNP rs12720356 (TYK2; see Table 1) for oppo-

site direction of effects. SNP rs12720356 had a p value of

4.09 3 10�5 in the combined analysis of panels A and B

(Figure S4D); there was an odds ratio (OR) of 1.29 (95%

confidence interval [CI] [1.10,1.51]) for allele A in panel A

(pPS-GWAS ¼ 1.39 3 10�3) and of 0.78 (95% CI [0.65,0.94])

in panel B (pCD-GWAS¼ 1.013 10�2).We then selected three

SNPs for subsequent genotyping (with Sequenom) and

testing in replication panel D (see Table S1D). The selection

criteria were the same as for the same-direction effect meta-

analysis. However, none of the three SNPs replicated in

both diseases at p value < 0.05 (see Table S3).

In Silico Fine-Mapping: Refinement of Association

Signals of COMBINED Approach

For refinement of the association signals at ZMIZ1 and

SOCS1, we imputed a region of about 51 Mb around the

strongest signals from the discovery panels A and B (see

Table S1) by using the EUR reference from the 1000

Genomes Project45 (see Subjects and Methods). In silico

fine-mapping of the region around SOCS1 via standard

meta-analysis methodology (see Subjects and Methods)

confirmed rs4780355 to be highly significant in this region

(pGWAS ¼ 4.04 3 10�7). Additionally, another SNP,

rs2021511, which is located in the same intron as

rs4780355 (2.9 kb downstream of rs4780355) showed

the same magnitude of association (pGWAS ¼ 1.58 3 10�7;

Figure 2) but was not selected for further replication

because of the high linkage disequilibrium (LD) between

SNPs rs4780355 and rs2021511 (r2 ¼ 0.934) according to

the 1000 Genomes Project EUR reference. Screening of

the imputed region of SOCS1 for coding SNPs revealed

one missense SNP with p < 10�4 within the TNP2 gene,

namely rs11640138. We genotyped this SNP in replicationTable

2.

AssociationResu

ltsofOVERLA

PAppro

achfrom

Cro

ss-D

isease

Compariso

nofEstablish

edRiskMark

ers

Chr

SNP

A1

Locus

CD

GW

ASMeta

-Analysisa

(6,333/15,056)

PSGW

AS

(2,529/4,955)

PSReplication

(6,115/7,504)

PSGW

ASand

Repl(8

,644/12,459)

CD

GW

ASMeta

-Analysisa

þPSGW

AS

andRepl(1

4,977/27,515)

Sta

tusNow

bp

OR

pOR

pOR

pOR

pOR

9rs10758669

CJAK2

1.0

310�13

1.18

2.433

10�03

1.13

2.473

10�03

1.08

2.693

10�05

1.10

1.303

10�16

1.14

CD-PS,

CD

11

rs694739

GPRDX5

3.4

310�07

0.89

1.133

10�04

0.86

6.123

10�06

0.89

3.713

10�09

0.88

2.413

10�14

0.89

PS,

CD-PS,

CD

19

rs281379

AFUT2

8.6

310�10

1.13

3.223

10�03

1.13

7.123

10�06

1.13

7.863

10�08

1.12

1.323

10�17

1.13

CD-PS,

CD

22

rs181359

GYDJC

6.3

310�13

0.83

4.833

10�03

0.88

3.543

10�08

0.84

8.023

10�10

0.85

1.333

10�21

0.84

PS,

CD-PS,

CD

17

rs744166c

AST

AT3

1.1

310�07

1.13

2.443

10�04

0.87

1.493

10�02

0.94

5.303

10�05

0.92

5.483

10�11

0.90

CD-PS,

CD

Thefollo

wingabbreviationsare

used:Chr,ch

romosomeofmarker;SNP,rsID;A1,minorallele;Lo

cus,onecandidate

genein

theregion;p/O

R,pvalueandco

rrespondingoddsratiowithrespect

tominoralleleforthelarge

GWASmeta-analysisofCD,13GWASmeta-analysisofPS(panelA),PSreplicationanalysis(panelC),co

mbinedanalysisofPSGWASmeta-analysis(panelA)andPSreplication(panelC

),andco

mbinedanalysisofCDGWAS

meta-analysis1

3andpanelsAandC.Fo

reach

panel,numbers

ofcases/co

ntrolsare

displayedin

parentheses.

aSeeFranke

etal.13

bStatusnow:new

statusofassociationwithCDand/orPS.AllSNPsare

establishedCDrisk

SNPswithp<

53

10�813thatwere

significant(p

<0.01)in

ourPSmeta-analysis.AllSNPs,exceptforrs744166,showedthesame

directionofeffect

forCD

andPS.NoneoftheSNPsshowedanexact

Hardy-W

einberg

pvalue<

0.01in

thePSreplication(panelC).

cMinorandmajorallelesofrs744166were

flippedin

theCD

GWASmeta-analysisin

orderto

calculate

theco

mbined-phenotypepvalueandoddsratio.

The American Journal of Human Genetics 90, 636–647, April 6, 2012 641

panel D, but it did not replicate in either disease at

p value < 0.05.

In silico fine-mapping of ZMIZ1 with the same method-

ology narrowed down the association signal to two coding

SNPs, namely rs1250559 (pCDPS-GWAS ¼ 1.53 3 10�7,

Figure 2) and rs1250560 (pCDPS-GWAS ¼ 3.13 3 10�6).

According to the 1000 Genomes Project EUR reference,

both SNPs are in near perfect LD (r2 ¼ 0.948). Depending

on different splice variants of ZMIZ1, rs1250559 is either

intronic or located in the 3-untranslated region (3-UTR),

whereas rs1250560 is either an intronic SNP or a missense

SNP located in exon 5. The intronic ZMIZ1 SNP rs1250544,

which yielded the strongest signal from the initial

same-effect combined-phenotype meta-analysis, and the

missense SNP rs1250560 are 20.6 kb apart and in moderate

LD (r2 ¼ 0.682). In order to substantiate our findings from

the in silico analyses, we genotyped both ZMIZ1 SNPs in

replication panels D and E (see Table S1). As shown in

Tables 3 and 4, both SNPs were associated with PS at the

0.05 level and even showed genome-wide significance with

CD (pCD-Repl ¼ 8.06 3 10�10 at rs1250560, pCD-Repl ¼4.10 3 10�10 at rs1250559). Interestingly, the association

signals of these two SNPs were much stronger in the initial

analysis of PS panel A than of CD panel B. The combined

analysis of discovery panels A and B and replication panels

D and E yielded genome-wide significance for rs1250560

(pCDPS-GWASþRepl ¼ 7.34 3 10�16) and rs1250559

(pCDPS-GWASþRepl ¼ 2.78 3 10�16), both of which are of

higher significance than was observed for rs1250544

(pCDPS-GWASþRepl ¼ 7.32 3 10�14) (Tables 3 and 4).

Effect on Gene Expression

We subsequently assessed a potential functional effect of

the four SNPs showing association for both CD and PS

with the same direction of effects, namely rs1250544,

rs1250559, rs1250560 (ZMIZ1), and rs4780355 (near

SOCS1). To this end, we investigated the correlation of

SNP genotypes with gene expression levels by means of

in silico expression quantitative trait locus (eQTL) analysis

byusing themRNAbySNPBrowser software.47 This program

utilizes genotype data from 408,273 SNPs and

gene expression data from Epstein-Barr-virus-transformed

lymphoblastoid cell lines that were collected from 400 chil-

dren and measured with the Affymetrix HG-U133 Plus 2.0

chip. Significant evidence (uncorrected pExpression < 10�4)

for causing differential expression of ZMIZ1 was observed

for SNP rs1250546 (prs1250546 ¼ 8.10 3 10�5), which is in

high LD with our lead SNP rs1250544 (r2 ¼ 0.829). Also,

we found an even stronger evidence for association

between expression of C16ORF75 (MIM 612426), which is

Table 4. Association Results of Combined-Phenotype Meta-Analysis Considering Same-Direction Effects of Alleles from COMBINEDApproach

Chr SNP A1 Locus

PS GWAS andRepl (5,716/9,714)

CD GWAS andRepl (6,215/7,983)

CDþPS GWAS andRepl (11,931/17,697)

Status Nowp OR p OR p OR

10 rs1250544a G ZMIZ1 3.53 3 10�08 1.16 2.56 3 10�07 1.16 7.32 3 10�14 1.16 PS, CD-PS, CD

10 rs1250560b A ZMIZ1 3.03 3 10�07 0.84 4.10 3 10�09 0.84 7.34 3 10�16 0.85 CD-PS, CD

10 rs1250559b A ZMIZ1 3.63 3 10�07 0.84 1.24 3 10�09 0.84 2.78 3 10�16 0.85 CD-PS, CD

16 rs4780355a T SOCS1 5.30 3 10�07 1.15 4.99 3 10�08 1.17 1.37 3 10�13 1.16 CD-PS, CD

For abbreviations used, see Table 3. Combined analysis of PS GWAS meta-analysis (panel A) and PS replication (part of panel D), combined analysis of CD GWASmeta-analysis (panel B) and CD replication (part of panel D, panel E), combined analysis of CDþPS GWAS meta-analysis (panels A and B) and CDþPS replication(panels D and E).aSNPs were identified via genotype imputation based on the HapMap3 reference and p/OR are given according to that analysis.bSNPs were identified via genotype imputation based on the 1000 Genomes reference and p/OR are given according to that analysis.

Table 3. Association Results of Combined-Phenotype Meta-Analysis Considering Same-Direction Effects of Alleles from COMBINEDApproach

Chr SNP A1 Locus

CDþPS DiscoveryGWAS (4,671/10,460)

PS GWAS(2,529/4,955)

CD GWAS(2,142/5,505)

PS Replication(3,187/4,759)

CD Replication(4,073/2,478)

p OR p OR p OR p OR p OR

10 rs1250544a G ZMIZ1 1.12 3 10�05 1.13 3.85 3 10�05 1.18 3.31 3 10�02 1.09 1.94 3 10�04 1.14 5.28 3 10�07 1.22

10 rs1250560b A ZMIZ1 3.13 3 10�06 0.87 4.31 3 10�06 0.82 4.87 3 10�02 0.92 1.06 3 10�03 0.87 8.06 3 10�10 0.79

10 rs1250559b A ZMIZ1 1.53 3 10�07 0.87 4.58 3 10�06 0.82 3.46 3 10�02 0.91 1.16 3 10�03 0.87 4.10 3 10�10 0.79

16 rs4780355a T SOCS1 9.36 3 10�07 1.16 1.72 3 10�04 1.18 1.47 3 10�03 1.15 7.53 3 10�04 1.14 7.40 3 10�06 1.19

The following abbreviations are used: Chr: chromosome of marker; SNP: rs ID; A1: minor allele; Locus: one candidate gene in the region; p/OR: p value andcorresponding odds ratio with respect to minor allele for the combined-phenotype GWAS meta-analysis of CD and PS (panels A and B), GWAS meta-analysisof PS (panel A), GWAS meta-analysis of CD (panel B), PS replication analysis (as part of panel D), CD replication analysis (part of panel D, panel E). For each panel,numbers of cases/controls are displayed in parentheses. None of the SNPs showed an exact Hardy-Weinberg p value < 0.01 in the PS and CD replication panels(panels D and E).aSNPs were identified via genotype imputation based on the HapMap3 reference and P/OR are given according to that analysis.bSNPs were identified via genotype imputation based on the 1000 Genomes reference and P/OR are given according to that analysis.

642 The American Journal of Human Genetics 90, 636–647, April 6, 2012

located 90 kb upstream of SOCS1, and SNP rs243323

(prs243323 ¼ 1.10 3 10�8). This SNP is also in high LD with

our lead SNP rs4780355 (r2 ¼ 0.931). Both proxy SNPs

rs1250546 and rs243323 were also significantly associated

in our same-effect combined-phenotype analysis of dis-

covery panels A and B (pGWAS¼ 7.153 10�5 for rs1250546,

pGWAS ¼ 1.80 3 10�6 for rs243323). This in silico eQTL

analysis supports the notion that our four reported SNPs

might affect the expression of ZMIZ1 and C16ORF75. The

full list of significant associations between SNP genotypes

and gene expression levels is shown in Table S4.

Discussion

In a large combined sample set of 6,215 CD cases, 8,644 PS

cases and 20,560 healthy controls, we have identified

seven non-HLA susceptibility loci shared between CD

and PS (9p24 near JAK2, 10q22 at ZMIZ1, 11q13 near

PRDX5, 16p13 near SOCS1, 19p13 near FUT2, 17q21 at

Figure 2. Regional Association Plots of In SilicoFine-Mapping for Newly Detected Shared RiskLoci from COMBINED ApproachShared risk loci for CD and PS at (A) 10q22(ZMIZ1) and (B) 16p13 (near SOCS1). eQTL anal-yses revealed a potential effect of the associationsat ZMIZ1 and near SOCS1 on gene expression. pvalues (�log10p) are depicted with regard to thephysical location of markers and are based onimputed genotypes. SNP genotypes were im-puted with the EUR reference from 1000Genomes Project45 (see Subjects and Methods).The following abbreviations are used: blue-filledcircle, lead SNP of the combined-phenotypedata (panels A and B); other filled circles,analyzed SNPs of the combined-phenotype data(panels A and B) where the fill color correspondsto the strength of linkage disequilibrium (r2) withthe lead SNP (for color coding see legend in theupper right corner of each plot); green triangles,analyzed SNPs of the meta-analysis on PS (panelA); gray squares, analyzed SNPs of the meta-anal-ysis on CD (panel B); and blue line, recombina-tion intensity (cM/Mb). Positions and gene anno-tations are according to NCBI’s build 37 (hg19).

STAT3, 22q11 at YDJC). These loci, except

for SOCS1, were already known to play a

role in CD etiology, but were of unknown

significance for PS13 (see also Table S5 for

associations with other diseases). Notably,

three of these loci showed genome-wide

significance when tested for association

with PS alone (10q22 at ZMIZ1, 11q13

near PRDX5, and 22q11 at YDJC). Further-

more, we revealed a risk locus for CD

(16p13 near SOCS1). The identified shared

risk loci point to functionally very inter-

esting genes that might play a role in the

pathogenesis of both CD and PS. The gene

ZMIZ1 (also known as hZIMP10 or TRAFIP10) encodes for

the protein zinc finger MIZ type 1, which is a member of

the protein inhibitor of activated STAT (PIAS) family. The

protein regulates the activity of several transcription

factors such as the androgen receptor, Smad3/4, and p53;

regulates TGF-b/SMAD signaling; and is induced by

retinoic acid.48 FUT2 encodes a-(1,2)fucosyltransferase

(FUT2), a physiological trait that regulates expression of

the Lewis human blood group of antigens on the surface

of epithelial cells and in body fluids. Genetic variants in

FUT2 have been implicated in susceptibility to infections

with Norovirus49 and Helicobacter pylori.50 PRDX5 encodes

Peroxiredoxin-5, which belongs to the peroxiredoxin

family of antioxidant enzymes that reduce hydrogen

peroxide and alkyl hydroperoxides and might play a

protective role during inflammatory processes. SOCS1

encodes the suppressor of cytokine signaling 1 (SOCS1),

a protein that is member of the STAT-induced STAT inhib-

itor (SSI), also known as suppressor of cytokine signaling

(SOCS) family. SOCS1 is a cytokine-inducible negative

The American Journal of Human Genetics 90, 636–647, April 6, 2012 643

regulator of cytokine signaling.51,52 Cytokines such as IL2,

IL3, erythropoietin, and interferon-gamma can induce

expression of SOCS1.53 Moreover, a potential functional

effect of the associations at ZMIZ1 and near SOCS1 on

gene expression was found by an in silico eQTL analysis.

The present study has increased the number of known

shared CD and PS susceptibility loci to eleven (IL12B,

IL23R, REL, TYK2, JAK2, ZMIZ1, PRDX5, SOCS1, STAT3,

FUT2, and YDJC). To quantify the degree of relatedness

between genes within the eleven loci, we used a published

statistical genomicsmethod, namely GRAIL (gene relation-

ships across implicated loci),46 that applies statistical text

mining to PubMed abstracts (see Subjects and Methods

and Figure 3). GRAIL highlights a number of nonrandom

and evidence-based connections between the genes within

the nine loci that might indicate overlap in the pathways

acting in the etiology of CD and PS. Multiple genes

(IL12B, IL23R, TYK2, JAK2, SOCS1, and STAT3) are

involved in IL23/Th17 signaling and play a critical role

in the principal signaling mechanism for a wide array of

cytokines and growth factors. It is noteworthy that genes

CDC37 ([MIM 605065] 21.7 kb downstream of the estab-

lished shared risk locus at TYK2) and STIP1 ([MIM

605063] 113.5 kb upstream of the identified shared risk

at PRDX5) were found by GRAIL to be significantly con-

nected. CDC37 and STIP1 encode CDC37 and STI1, respec-

tively, two of several auxiliary proteins that associate with

the heat-shock protein 90 (HSP90) molecular chaperone

and thus are collectively referred to as HSP90 cochaper-

ones.54 HSP90 itself is an abundant, evolutionarily con-

served molecular chaperone that acts mainly as a cofactor

for the folding of polyproteins into functional, stable,

mature proteins, and it physically associates with JAK1

and probably JAK2,55 demonstrating that JAK1/2 are

client proteins of HSP90. A study in mice and in patient

samples suggested that HSP90 inhibitors might help treat

Figure 3. Gene Relationships across the 11 Shared Risk Loci of CD and PS Identified by GRAIL AnalysisGRAIL46 is a statistical text-mining approach to quantify the degree of relatedness among genes in genomic disease regions. It estimatesthe statistical significance of the number of observed relationships with a null model in which relationships between the genes occur byrandom chance. A significance score ptext, which is adjusted for multiple hypothesis testing, represents the output GRAIL score. ptext

values approximately estimate type-I error rates. Outer circle: lead SNPs from shared risk loci of both diseases; each box representsa SNP. Inner circle: genes of the genomic regions around lead SNPs that were identified based on LD properties; each box representsa gene; genes that were scored at ptext < 0.05 are significantly linked to genes in the other disease regions and are indicated in boldtype. Lines: the lines between genes represent significant connections, with the thickness and redness of the lines being inverselyproportional to the probability that a text-based connection would be seen by chance.

644 The American Journal of Human Genetics 90, 636–647, April 6, 2012

JAK2-dependent myeloproliferative neoplasms (MPNs).56

Moreover, inhibition of HSP90 was found to block Nod2-

mediated activation of the transcription factor NF-kB and

reduce NALP3-mediated gout-like inflammation in

mice,57 and mutations in the gene encoding NALP3, a

member of the Nod-like receptor (NLR) protein family,

are associated with several autoinflammatory disor-

ders.58,59 Our hypothesis that CDC37 and STIP1 are poten-

tial joint risk factors for CD and PS is substantiated by an

association peak within CDC37 in our same-effect com-

bined-phenotype analysis of discovery panels A and B

(pGWAS ¼ 1.61 3 10�3 for rs11879191, Figure S6).

It is worth noting that we used a two tier strategy to iden-

tify shared disease risk loci: Both approaches turned out to

be effective and complementary tools for gaining insights

into the postulated shared pathogenesis of CD and PS.

Application of only a single strategy would have decreased

the number of identified loci. Although the OVERLAP

approach represents a simple and cost-effective strategy

(cross-disease comparison of known risk SNPs), the

COMBINED approach provides the power to identify

shared susceptibility loci even if association signals are

heterogeneous between diseases, that is the particular

SNP showing the smallest p value at the considered locus,

as was the case, for example, for the identified risk locus at

ZMIZ1. This heterogeneity of most strongly associated

SNPs could be due to interactions with other genetic

variants or environmental factors, to differences in the

distribution or effect size of causal alleles, or to the fact

the identified SNPs show an association signal only

because they are in LD with the actual causal variant. In

particular, the increase of power due to increased sample

sizes makes the COMBINED approach a potentially power-

ful tool to detect shared risk loci that might be missed in

disease-specific GWASs that are often underpowered

because of their comparatively smaller sample sizes.

Because we did not search for loci harboring association

signals with different and independent SNPs in terms of

LD associated with CD and PS, there is room for improve-

ment. For instance, in a simple rank approach with regard

to single-marker associationpvalues, different disease-asso-

ciated markers for the same locus could be determined

when they rank high with regard to their p value in associ-

ation scans of CD and PS, respectively. This would allow

detecting shared susceptibility loci even if association

signals are heterogeneous between diseases. An approach

tomeet the challenge of theheterogeneity of genetic effects

of the same markers between different diseases was

proposed by Morris et al.60 The authors developed a test of

association within a multinomial regression framework

and demonstrated the improved power of their multino-

mial regression-based analysis over existing methods.

It is likely that future studies will identify additional

shared disease loci for CD and PS by further increasing

the sample size of analyzed case-control panels or, for

example, by applying the suggested rank approach.

Evidence for a shared etiological basis among several auto-

immune and inflammatory diseases is growing. For Crohn

disease, for example, Lees and colleagues currently re-

ported that 51 of the known 71 loci overlap with more

than 23 distinct diseases, comprising also several nonau-

toimmune conditions.61 Given the success of this study,

we expect the same for the investigation of further combi-

nations of such diseases for shared risk factors.

Supplemental Data

Supplemental Data include six figures and five tables and can be

found with this article online at http://www.cell.com/AJHG/.

Acknowledgments

We thank all individuals with psoriasis or CD, their families,

control individuals and clinicians for their participation in this

project. We thank the WTCCC consortium for the access to the

CDcase/control data.Weacknowledge the cooperationofGenizon

Biosciences.Wewish to thank TanjaWesse, TanjaHenke and Susan

Ehlers for expert technical help. We acknowledge EGCUT and

Estonian Biocentre personnel, especially Ms. M. Hass and Mr. V.

Soo. A list of funding sources is included in the Supplemental Data.

Received: October 28, 2011

Revised: January 30, 2012

Accepted: February 16, 2012

Published online: April 5, 2012

Web Resources

The URLs for data presented herein are as follows:

1000 Genomes Project, http://www.1000genomes.org/

BEAGLE, http://faculty.washington.edu/browning/beagle/beagle.html

dbGaP, http://www.ncbi.nlm.nih.gov/gap

EIGENSTRAT, http://genepath.med.harvard.edu/~reich/Software.htm

GRAIL, http://www.broadinstitute.org/mpg/grail/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

PLINK, http://pngu.mgh.harvard.edu/~purcell/plink/

PopGen Biobank, http://www.popgen.de

VIZ-GRAIL, http://www.broadinstitute.org/mpg/grail/vizgrail.html

References

1. Bhalerao, J., and Bowcock, A.M. (1998). The genetics of psori-

asis: A complex disorder of the skin and immune system.

Hum. Mol. Genet. 7, 1537–1545.

2. Elder, J.T., Nair, R.P., Guo, S.W., Henseler, T., Christophers, E.,

and Voorhees, J.J. (1994). The genetics of psoriasis. Arch. Der-

matol. 130, 216–224.

3. Russell, R.K., and Satsangi, J. (2004). IBD: A family affair. Best

Pract. Res. Clin. Gastroenterol. 18, 525–539.

4. Griffiths, C.E., and Barker, J.N. (2007). Pathogenesis and clin-

ical features of psoriasis. Lancet 370, 263–271.

5. Logan, I., andBowlus,C.L. (2010). Thegeoepidemiologyof auto-

immune intestinal diseases. Autoimmun. Rev. 9, A372–A378.

6. Sagoo, G.S., Cork, M.J., Patel, R., and Tazi-Ahnini, R. (2004).

Genome-wide studies of psoriasis susceptibility loci: A review.

J. Dermatol. Sci. 35, 171–179.

The American Journal of Human Genetics 90, 636–647, April 6, 2012 645

7. Najarian, D.J., and Gottlieb, A.B. (2003). Connections

between psoriasis and Crohn’s disease. J. Am. Acad. Dermatol.

48, 805–821, quiz 822–804.

8. Khor, B., Gardet, A., andXavier, R.J. (2011). Genetics and path-

ogenesis of inflammatory bowel disease. Nature 474, 307–317.

9. Yates, V.M., Watkinson, G., and Kelman, A. (1982). Further

evidence for an association between psoriasis, Crohn’s disease

and ulcerative colitis. Br. J. Dermatol. 106, 323–330.

10. Bernstein, C.N., Wajda, A., and Blanchard, J.F. (2005). The

clustering of other chronic inflammatory diseases in inflam-

matory bowel disease: A population-based study. Gastroenter-

ology 129, 827–836.

11. Weng, X., Liu, L., Barcellos, L.F., Allison, J.E., and Herrinton,

L.J. (2007). Clustering of inflammatory bowel disease with

immunemediated diseases amongmembers of a northern cal-

ifornia-managed care organization. Am. J. Gastroenterol. 102,

1429–1435.

12. Barrett, J.C., Hansoul, S., Nicolae, D.L., Cho, J.H., Duerr, R.H.,

Rioux, J.D., Brant, S.R., Silverberg, M.S., Taylor, K.D., Barmada,

M.M., et al; NIDDK IBD Genetics Consortium; Belgian-French

IBD Consortium; Wellcome Trust Case Control Consortium.

(2008). Genome-wide association defines more than 30 distinct

susceptibility loci for Crohn’s disease. Nat. Genet. 40, 955–962.

13. Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., Radford-

Smith, G.L., Ahmad, T., Lees, C.W., Balschun, T., Lee, J., Rob-

erts, R., et al. (2010). Genome-wide meta-analysis increases to

71 the number of confirmed Crohn’s disease susceptibility

loci. Nat. Genet. 42, 1118–1125.

14. Nair, R.P., Duffin, K.C., Helms, C., Ding, J., Stuart, P.E., Gold-

gar, D., Gudjonsson, J.E., Li, Y., Tejasvi, T., Feng, B.J., et al;

Collaborative Association Study of Psoriasis. (2009).

Genome-wide scan reveals association of psoriasis with IL-23

and NF-kappaB pathways. Nat. Genet. 41, 199–204.

15. Ellinghaus, E., Ellinghaus, D., Stuart, P.E., Nair, R.P., Debrus, S.,

Raelson, J.V., Belouchi, M., Fournier, H., Reinhard, C., Ding, J.,

et al. (2010). Genome-wide association study identifies a psori-

asis susceptibility locus at TRAF3IP2. Nat. Genet. 42, 991–995.

16. Strange, A., Capon, F., Spencer, C.C., Knight, J., Weale, M.E.,

Allen, M.H., Barton, A., Band, G., Bellenguez, C., Bergboer,

J.G., et al; Genetic Analysis of Psoriasis Consortium & the

Wellcome Trust Case Control Consortium 2. (2010). A

genome-wide association study identifies new psoriasis

susceptibility loci and an interaction between HLA-C and

ERAP1. Nat. Genet. 42, 985–990.

17. Sun, L.D., Cheng, H., Wang, Z.X., Zhang, A.P., Wang, P.G., Xu,

J.H., Zhu, Q.X., Zhou, H.S., Ellinghaus, E., Zhang, F.R., et al.

(2010). Association analyses identify six new psoriasis suscepti-

bility loci in theChinesepopulation.Nat.Genet.42, 1005–1009.

18. Stuart, P.E., Nair, R.P., Ellinghaus, E., Ding, J., Tejasvi, T.,

Gudjonsson, J.E., Li, Y., Weidinger, S., Eberlein, B., Gieger,

C., et al. (2010). Genome-wide association analysis identifies

three psoriasis susceptibility loci. Nat. Genet. 42, 1000–1004.

19. Huffmeier, U., Uebe, S., Ekici, A.B., Bowes, J., Giardina, E., Ko-

rendowych, E., Juneblad, K., Apel, M., McManus, R., Ho, P.,

et al. (2010). Common variants at TRAF3IP2 are associated

with susceptibility to psoriatic arthritis and psoriasis. Nat.

Genet. 42, 996–999.

20. Zhang, X.J., Huang, W., Yang, S., Sun, L.D., Zhang, F.Y., Zhu,

Q.X., Zhang, F.R., Zhang, C., Du, W.H., Pu, X.M., et al.

(2009). Psoriasis genome-wide association study identifies

susceptibility variants within LCE gene cluster at 1q21. Nat.

Genet. 41, 205–210.

21. Duerr, R.H., Taylor, K.D., Brant, S.R., Rioux, J.D., Silverberg,

M.S., Daly, M.J., Steinhart, A.H., Abraham, C., Regueiro, M.,

Griffiths, A., et al. (2006). A genome-wide association study

identifies IL23R as an inflammatory bowel disease gene.

Science 314, 1461–1463.

22. Cargill, M., Schrodi, S.J., Chang, M., Garcia, V.E., Brandon, R.,

Callis, K.P., Matsunami, N., Ardlie, K.G., Civello, D., Catanese,

J.J., et al. (2007). A large-scale genetic association study

confirms IL12B and leads to the identification of IL23R as

psoriasis-risk genes. Am. J. Hum. Genet. 80, 273–290.

23. Nair, R.P.,Ruether,A., Stuart, P.E., Jenisch, S., Tejasvi, T.,Hirema-

galore, R., Schreiber, S., Kabelitz, D., Lim, H.W., Voorhees, J.J.,

et al. (2008). Polymorphisms of the IL12B and IL23R genes are

associated with psoriasis. J. Invest. Dermatol. 128, 1653–1661.

24. Mannon, P.J., Fuss, I.J., Mayer, L., Elson, C.O., Sandborn, W.J.,

Present, D., Dolin, B., Goodman, N., Groden, C., Hornung,

R.L., et al; Anti-IL-12 Crohn’s Disease Study Group. (2004).

Anti-interleukin-12 antibody for active Crohn’s disease. N.

Engl. J. Med. 351, 2069–2079.

25. Abraham, C., and Cho, J.H. (2009). IL-23 and autoimmunity:

New insights into the pathogenesis of inflammatory bowel

disease. Annu. Rev. Med. 60, 97–110.

26. Di Meglio, P., Di Cesare, A., Laggner, U., Chu, C.C., Napolitano,

L., Villanova, F., Tosi, I., Capon, F., Trembath, R.C., Peris, K., and

Nestle, F.O. (2011). The IL23R R381Q gene variant protects

against immune-mediated diseases by impairing IL-23-induced

Th17 effector response in humans. PLoS ONE 6, e17160.

27. Franke, A., Fischer, A., Nothnagel, M., Becker, C., Grabe, N.,

Till, A., Lu, T., Muller-Quernheim, J., Wittig, M., Hermann,

A., et al. (2008). Genome-wide association analysis in sarcoid-

osis and Crohn’s disease unravels a common susceptibility

locus on 10p12.2. Gastroenterology 135, 1207–1215.

28. Wellcome Trust Case Control Consortium. (2007). Genome-

wide association study of 14,000 cases of seven common

diseases and 3,000 shared controls. Nature 447, 661–678.

29. Franke, A., Balschun, T., Karlsen, T.H., Hedderich, J., May, S.,

Lu, T., Schuldt, D., Nikolaus, S., Rosenstiel, P., Krawczak, M.,

and Schreiber, S. (2008). Replication of signals from recent

studies of Crohn’s disease identifies previously unknown

disease loci for ulcerative colitis. Nat. Genet. 40, 713–715.

30. Wang, K., Baldassano, R., Zhang, H., Qu, H.Q., Imielinski, M.,

Kugathasan, S., Annese, V., Dubinsky, M., Rotter, J.I., Russell,

R.K., et al. (2010). Comparative genetic analysis of inflamma-

tory bowel disease and type 1 diabetes implicates multiple loci

with opposite effects. Hum. Mol. Genet. 19, 2059–2067.

31. Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M., Wal-

lace, C., Abecasis, G.R., Barrett, J.C., Behrens, T., Cho, J., et al;

FOCiS Network of Consortia. (2011). Pervasive sharing of

genetic effects inautoimmunedisease.PLoSGenet.7, e1002254.

32. Imielinski, M., Baldassano, R.N., Griffiths, A., Russell, R.K.,

Annese, V., Dubinsky, M., Kugathasan, S., Bradfield, J.P.,

Walters, T.D., Sleiman, P., et al; Western Regional Alliance for

Pediatric IBD; International IBDGenetics Consortium;NIDDK

IBD Genetics Consortium; Belgian-French IBD Consortium;

Wellcome Trust Case Control Consortium. (2009). Common

variants at five new loci associated with early-onset inflamma-

tory bowel disease. Nat. Genet. 41, 1335–1340.

33. Festen, E.A., Goyette, P., Green, T., Boucher, G., Beauchamp,

C., Trynka, G., Dubois, P.C., Lagace, C., Stokkers, P.C.,

Hommes, D.W., et al. (2011). A meta-analysis of genome-

wide association scans identifies IL18RAP, PTPN2, TAGAP,

646 The American Journal of Human Genetics 90, 636–647, April 6, 2012

and PUS10 as shared risk loci for Crohn’s disease and celiac

disease. PLoS Genet. 7, e1001283.

34. Zhernakova, A., Stahl, E.A., Trynka, G., Raychaudhuri, S.,

Festen, E.A., Franke, L., Westra, H.J., Fehrmann, R.S., Kurree-

man, F.A., Thomson, B., et al. (2011). Meta-analysis of

genome-wide association studies in celiac disease and rheu-

matoid arthritis identifies fourteen non-HLA shared loci.

PLoS Genet. 7, e1002004.

35. Krawczak, M., Nikolaus, S., von Eberstein, H., Croucher, P.J., El

Mokhtari, N.E., and Schreiber, S. (2006). PopGen: Population-

based recruitment of patients and controls for the analysis of

complex genotype-phenotype relationships. Community

Genet. 9, 55–61.

36. Wichmann, H.E., Gieger, C., and Illig, T.; MONICA/KORA

Study Group. (2005). KORA-gen—resource for population

genetics, controls and a broad spectrum of disease pheno-

types. Gesundheitswesen 67 (Suppl 1 ), S26–S30.

37. Weiland, S.K., Bjorksten, B., Brunekreef, B., Cookson, W.O.,

von Mutius, E., and Strachan, D.P.; International Study of

Asthma and Allergies in Childhood Phase II Study Group.

(2004). Phase II of the International Study of Asthma and

Allergies in Childhood (ISAAC II): Rationale and methods.

Eur. Respir. J. 24, 406–412.

38. Manolio, T.A., Rodriguez, L.L., Brooks, L., Abecasis, G., Ballin-

ger, D., Daly, M., Donnelly, P., Faraone, S.V., Frazer, K., Gabriel,

S., et al; GAIN Collaborative Research Group; Collaborative

Association Study of Psoriasis; International Multi-Center

ADHD Genetics Project; Molecular Genetics of Schizophrenia

Collaboration; Bipolar Genome Study; Major Depression

Stage 1 Genomewide Association in Population-Based Sam-

ples Study; Genetics of Kidneys in Diabetes (GoKinD) Study.

(2007). New models of collaboration in genome-wide associa-

tion studies: The Genetic Association Information Network.

Nat. Genet. 39, 1045–1051.

39. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E.,

Shadick, N.A., and Reich, D. (2006). Principal components

analysis corrects for stratification in genome-wide association

studies. Nat. Genet. 38, 904–909.

40. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira,

M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly,

M.J., and Sham, P.C. (2007). PLINK: A tool set for whole-

genome association and population-based linkage analyses.

Am. J. Hum. Genet. 81, 559–575.

41. Browning, B.L., and Browning, S.R. (2009). A unified approach

to genotype imputation and haplotype-phase inference for

large data sets of trios and unrelated individuals. Am. J.

Hum. Genet. 84, 210–223.

42. Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M.,

Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu,

F., Peltonen, L., et al; International HapMap 3 Consortium.

(2010). Integrating common and rare genetic variation in

diverse human populations. Nature 467, 52–58.

43. Devlin, B., and Roeder, K. (1999). Genomic control for associ-

ation studies. Biometrics 55, 997–1004.

44. de Bakker, P.I., Ferreira, M.A., Jia, X., Neale, B.M., Raychaud-

huri, S., and Voight, B.F. (2008). Practical aspects of imputa-

tion-drivenmeta-analysis of genome-wide association studies.

Hum. Mol. Genet. 17 (R2), R122–R128.

45. 1000 Genomes Project Consortium. (2010). A map of human

genome variation from population-scale sequencing. Nature

467, 1061–1073.

46. Raychaudhuri, S., Plenge, R.M., Rossin, E.J., Ng, A.C., Purcell,

S.M., Sklar, P., Scolnick, E.M., Xavier, R.J., Altshuler, D., and

Daly, M.J.; International Schizophrenia Consortium. (2009).

Identifying relationships among genomic disease regions:

Predicting genes at pathogenic SNP associations and rare dele-

tions. PLoS Genet. 5, e1000534.

47. Dixon, A.L., Liang, L., Moffatt, M.F., Chen, W., Heath, S.,

Wong, K.C., Taylor, J., Burnett, E., Gut, I., Farrall, M., et al.

(2007). A genome-wide association study of global gene

expression. Nat. Genet. 39, 1202–1207.

48. Li, X., Thyssen, G., Beliakoff, J., and Sun, Z. (2006). The novel

PIAS-like protein hZimp10 enhances Smad transcriptional

activity. J. Biol. Chem. 281, 23748–23756.

49. Carlsson, B., Kindberg, E., Buesa, J., Rydell, G.E., Lidon, M.F.,

Montava, R., Abu Mallouh, R., Grahn, A., Rodrıguez-Dıaz, J.,

Bellido, J., et al. (2009). The G428A nonsense mutation

in FUT2 provides strong but not absolute protection against

symptomatic GII.4 Norovirus infection. PLoS ONE 4, e5593.

50. Ikehara, Y., Nishihara, S., Yasutomi,H., Kitamura, T.,Matsuo, K.,

Shimizu, N., Inada, K., Kodera, Y., Yamamura, Y., Narimatsu, H.,

et al. (2001). Polymorphisms of two fucosyltransferase genes

(Lewis and Secretor genes) involving type I Lewis antigens are

associated with the presence of anti-Helicobacter pylori IgG

antibody. Cancer Epidemiol. Biomarkers Prev. 10, 971–977.

51. Starr, R., Willson, T.A., Viney, E.M., Murray, L.J., Rayner, J.R.,

Jenkins, B.J., Gonda, T.J., Alexander, W.S., Metcalf, D., Nicola,

N.A., and Hilton, D.J. (1997). A family of cytokine-inducible

inhibitors of signalling. Nature 387, 917–921.

52. Yasukawa, H., Sasaki, A., and Yoshimura, A. (2000). Negative

regulation of cytokine signaling pathways. Annu. Rev. Immu-

nol. 18, 143–164.

53. Krebs, D.L., and Hilton, D.J. (2000). SOCS: Physiological

suppressors of cytokine signaling. J. Cell Sci. 113, 2813–2819.

54. Abbas-Terki, T., Briand, P.A., Donze, O., and Picard, D. (2002).

The Hsp90 co-chaperones Cdc37 and Sti1 interact physically

and genetically. Biol. Chem. 383, 1335–1342.

55. Shang, L., and Tomasi, T.B. (2006). The heat shock protein 90-

CDC37 chaperone complex is required for signaling by types I

and II interferons. J. Biol. Chem. 281, 1876–1884.

56. Marubayashi, S., Koppikar, P., Taldone, T., Abdel-Wahab, O.,

West, N., Bhagwat, N., Caldas-Lopes, E., Ross, K.N., Gonen,

M., Gozman, A., et al. (2010). HSP90 is a therapeutic target

in JAK2-dependent myeloproliferative neoplasms in mice

and humans. J. Clin. Invest. 120, 3578–3593.

57. Mayor, A., Martinon, F., De Smedt, T., Petrilli, V., and Tschopp,

J. (2007). A crucial function of SGT1 and HSP90 in inflamma-

some activity links mammalian and plant innate immune

responses. Nat. Immunol. 8, 497–503.

58. Hoffman, H.M., Mueller, J.L., Broide, D.H., Wanderer, A.A., and

Kolodner, R.D. (2001). Mutation of a new gene encoding a puta-

tive pyrin-like protein causes familial cold autoinflammatory

syndromeandMuckle-Wells syndrome.Nat.Genet.29, 301–305.

59. Hawkins,P.N.,Lachmann,H.J.,Aganna,E., andMcDermott,M.F.

(2004). Spectrum of clinical features in Muckle-Wells syndrome

and response to anakinra. Arthritis Rheum. 50, 607–612.

60. Morris, A.P., Lindgren,C.M., Zeggini, E., Timpson,N.J., Frayling,

T.M., Hattersley, A.T., and McCarthy, M.I. (2010). A powerful

approach to sub-phenotype analysis in population-based

genetic association studies. Genet. Epidemiol. 34, 335–343.

61. Lees, C.W., Barrett, J.C., Parkes, M., and Satsangi, J. (2011).

New IBD genetics: Common pathways with other diseases.

Gut 60, 1739–1753.

The American Journal of Human Genetics 90, 636–647, April 6, 2012 647

ARTICLE

Identification of IRF8, TMEM39A, and IKZF3-ZPBP2 asSusceptibility Loci for Systemic Lupus Erythematosusin a Large-Scale Multiracial Replication Study

Christopher J. Lessard,1,2 Indra Adrianto,1 John A. Ice,1 Graham B. Wiley,1 Jennifer A. Kelly,1

Stuart B. Glenn,1 Adam J. Adler,1 He Li,1,2 Astrid Rasmussen,1 Adrienne H. Williams,3 Julie Ziegler,3

Mary E. Comeau,3 Miranda Marion,3 Benjamin E. Wakeland,4 Chaoying Liang,4 Paula S. Ramos,5

Kiely M. Grundahl,1 Caroline J. Gallant,6 Marta E. Alarcon-Riquelme for the BIOLUPUS andGENLES Networks,1,7 Graciela S. Alarcon,8 Juan-Manuel Anaya,9 Sang-Cheol Bae,10 Susan A. Boackle,11

Elizabeth E. Brown,8 Deh-Ming Chang,12 Soo-Kyung Cho,10 Lindsey A. Criswell,13 Jeffrey C. Edberg,8

Barry I. Freedman,14 Gary S. Gilkeson,5 Chaim O. Jacob,15 Judith A. James,1,2,16 Diane L. Kamen,5

Robert P. Kimberly,8 Jae-Hoon Kim,10 Javier Martin,17 Joan T. Merrill,18 Timothy B. Niewold,19

So-Yeon Park,10 Michelle A. Petri,20 Bernardo A. Pons-Estel,21 Rosalind Ramsey-Goldman,22

John D. Reveille,23 R. Hal Scofield,1,2,16,24 Yeong Wook Song,25 Anne M. Stevens,26,27 Betty P. Tsao,28

Luis M. Vila,29 Timothy J. Vyse,30 Chack-Yung Yu,31,32 Joel M. Guthridge,1 Kenneth M. Kaufman,1,33,34

John B. Harley,33,34 Edward K. Wakeland,4 Carl D. Langefeld,3 Patrick M. Gaffney,1,2

Courtney G. Montgomery,1 and Kathy L. Moser1,2,*

Systemic lupus erythematosus (SLE) is a chronic heterogeneous autoimmune disorder characterized by the loss of tolerance to self-anti-

gens and dysregulated interferon responses. The etiology of SLE is complex, involving both heritable and environmental factors.

Candidate-gene studies and genome-wide association (GWA) scans have been successful in identifying new loci that contribute to

disease susceptibility; however, much of the heritable risk has yet to be identified. In this study, we sought to replicate 1,580 variants

showing suggestive association with SLE in a previously published GWA scan of European Americans; we tested a multiethnic

population consisting of 7,998 SLE cases and 7,492 controls of European, African American, Asian, Hispanic, Gullah, and Amerindian

ancestry to find association with the disease. Several genes relevant to immunological pathways showed association with SLE. Three

loci exceeded the genome-wide significance threshold: interferon regulatory factor 8 (IRF8; rs11644034; pmeta-Euro ¼ 2.08 3 10�10),

transmembrane protein 39A (TMEM39A; rs1132200; pmeta-all ¼ 8.62 3 10�9), and 17q21 (rs1453560; pmeta-all ¼ 3.48 3 10�10)

between IKAROS family of zinc finger 3 (AIOLOS; IKZF3) and zona pellucida binding protein 2 (ZPBP2). Fine mapping, resequencing,

imputation, and haplotype analysis of IRF8 indicated that three independent effects tagged by rs8046526, rs450443, and rs4843869,

respectively, were required for risk in individuals of European ancestry. Eleven additional replicated effects (5 3 10�8 < pmeta-Euro <

9.99 3 10�5) were observed with CFHR1, CADM2, LOC730109/IL12A, LPP, LOC63920, SLU7, ADAMTSL1, C10orf64, OR8D4,

FAM19A2, and STXBP6. The results of this study increase the number of confirmed SLE risk loci and identify others warranting further

investigation.

1Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK 73104, USA; 2Department of

Pathology, University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA; 3Department of Biostatistical Sciences, Wake Forest University

Health Sciences, Winston-Salem, NC 27157, USA; 4Department of Immunology, University of Texas Southwestern Medical Center at Dallas, Dallas, TX

75390, USA; 5Division of Rheumatology and Immunology, Department of Medicine, Medical University of South Carolina, Charleston, SC 29425, USA;6Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, Uppsala 75105, Sweden; 7Centro de Genomica e Investigaciones

Oncologicas, Pfizer-Universidad de Granada-Junta de Andalucıa, Granada 18100, Spain; 8Division of Clinical Immunology and Rheumatology, Department

of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294, USA; 9Center for Autoimmune Diseases Research, Universidad del Rosario,

Bogota, Colombia; 10Department of Rheumatology, Hanyang University Hospital for Rheumatic Diseases, Seoul 133-792, Korea; 11Division of

Rheumatology, University of Colorado Denver, Aurora, CO 80045, USA; 12National Defense Medical Center, Taipei 114, Taiwan; 13Rosalind Russell Medical

Research Center for Arthritis, University of California, San Francisco, San Francisco, CA 94143, USA; 14Section on Nephrology, Department of Internal

Medicine, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA; 15Department of Medicine, University of Southern California, Los Angeles,

CA 90089, USA; 16Department of Medicine, University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA; 17Instituto de Parasitologıa

y Biomedicina Lopez-Neyra, Consejo Superior de Investigaciones Cientificas, Granada 18100, Spain; 18Clinical Pharmacology, OklahomaMedical Research

Foundation, Oklahoma City, OK 73104, USA; 19Section of Rheumatology and Gwen Knapp Center for Lupus and Immunology Research, University of Chi-

cago, Chicago, IL 60637, USA; 20Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA; 21Sanatorio Parque,

Rosario 2000, Argentina; 22Division of Rheumatology, Northwestern University Feinberg School ofMedicine, Chicago, IL 60611, USA; 23Rheumatology and

Clinical Immunogenetics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA; 24US Department of Veterans Affairs Medical

Center, Oklahoma City, OK 73104, USA; 25Division of Rheumatology, Seoul National University, Seoul 110-799, Korea; 26Division of Rheumatology,

Department of Pediatrics, University of Washington, Seattle, WA 98105, USA; 27Center for Immunity and Immunotherapies, Seattle Children’s Research

Institute, Seattle, WA 98105, USA; 28Division of Rheumatology, Department of Medicine, University of California, Los Angeles, Los Angeles, CA 90095,

USA; 29Division of Rheumatology, Department of Medicine, University of Puerto Rico Medical Sciences Campus, San Juan 00936-5067, Puerto Rico;30Division of Genetics and Molecular Medicine and Division of Immunology, Infection, and Inflammatory Disease, King’s College London, London SE1

9RT, UK; 31Center for Molecular and Human Genetics, The Research Institute, Nationwide Children’s Hospital, Columbus, OH 43205, USA; 32Department

of Pediatrics, Ohio State University, Columbus, OH 43205, USA; 33Division of Rheumatology, Cincinnati Children’s Hospital Medical Center, Cincinnati,

OH 45229, USA; 34US Department of Veterans Affairs Medical Center, Cincinnati, OH 45220, USA

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.023. �2012 by The American Society of Human Genetics. All rights reserved.

648 The American Journal of Human Genetics 90, 648–660, April 6, 2012

Introduction

Systemic lupus erythematosus (SLE [MIM 152700]) is a

chronic autoimmune disease that is classically character-

ized by inflammation, dysregulated type 1 interferon

responses, and autoantibodies directed to the nuclear

compartment. Women of childbearing age are preferen-

tially affected at a rate nine times that of men, and those

of African American and Asian ancestries are affected

more frequently and manifest more severe disease than

those of European ancestry.1 Although the etiology of SLE

is largely unknown, its pathogenesis most likely involves

a complex interplay between environmental (e.g., UV light,

Epstein-Barr virus infection, etc.) and genetic (e.g., MHC,

IRF5 [MIM 607218], etc.) components.2 A sibling risk ratio

(ls) of approximately 30 in SLE illustrates a strong genetic

component,3 and the fact that observational studies have

identified many families with multiple cases of SLE and

other autoimmune conditions suggests the potential for

shared genetic predisposition.4–6

Candidate-gene studies and, more recently, genome-

wide association (GWA) scans have been highly successful

in identifyingmultiple susceptibility loci.2,7 The histocom-

patibility leukocyte antigen (HLA) region has been known

to contribute to the risk of SLE and other related autoim-

mune diseases since the 1970s.8–11 In the early 2000s,

gene expression studies determined that, compared to

healthy controls, individuals with SLE overexpress genes

in the interferon pathway.12–14 Association between SLE

and variants in the region of IRF5 was first reported in

2005 and has since been replicated in most GWA scans

of SLE.15–19 In 2008, four GWA scans of SLE cases of

European descent were published, and the first GWA

scan of Asian descent was published in 2009.16,17,19–21

Collectively, these studies have identified and confirmed

~35 loci that contribute to the pathogenesis of SLE. These

data highlight the importance of several pathways,

including those involving lymphocyte activation and

function, immune-complex clearance, innate immune

response, and adaptive immune responses.2 However,

a substantial portion of the heritable risk has yet to be iden-

tified.17,22 The lack of causal variants, rare variants, and/or

other loci yet to be discovered might account for the

missing heritability.

In an effort to identify regions contributing to SLE risk,

we sought to replicate suggestive association signals in

our previously published European American SLE GWA

scan.19 We evaluated 1,580 single-nucleotide polymor-

phisms (SNPs) in an independent population of 7,998

SLE cases and 7,492 controls of European, African Amer-

ican, Asian, Hispanic, Gullah, and Amerindian ancestry

(Tables S1–S3, available online). Three loci, interferon

regulatory factor 8 (IRF8 [MIM 601565]), transmembrane

protein 39A (TMEM39A), and the region between IKAROS

family of zinc finger 3 (AIOLOS; IKZF3 [MIM 606221]) and

zona pellucida binding protein 2 (ZPBP2 [MIM 608449])

exceeded the genome-wide significance threshold (p <

5 3 10�8). Through fine mapping, resequencing, and

imputation of the IRF8 region, we identified three inde-

pendent effects required for risk. Moreover, we replicated

11 other loci, several of which had been previously re-

ported in related conditions.

Subjects and Methods

GWA ScanGenotyping, quality control, procedures for data analysis, and

summary statistics for the GWA scan were described previously

in Graham et al., 2008.19

Study DesignThe genotype data used in this study were generated as a part of

a joint effort of more than 40 investigators from around the world.

These investigators contributed samples, funding, and hypotheses

on a combined array containing ~35,000 SNPs (Figure S1). The

Oklahoma Medical Research Foundation (OMRF) served as the

coordinating center, ran the arrays, and sent the data to a central

facility for quality control at Wake Forest Medical Center. These

data were then distributed back to the investigators, who re-

quested the SNPs for final analysis and publication.23–28

SubjectsThemultiracial replication study consisted of 17,003 total samples

(8,922 SLE cases and 8,077 controls) and included individuals

of self-reported African American, Asian, European, Gullah,

Hispanic, and Amerindian ancestry (Table S1). A total of 374

samples were common between the GWA scan and the replication

study so that genotypes generated by the two platforms could

be confirmed and so that genotypes of SNPs not present on the

Affymetrix 5.0 array could be obtained. These data were only

used as observed data for the imputation analysis of specific

genomic regions, as described below; to maintain independence

between the GWA scan and replication samples, we did not

include the data generated on these shared samples in the replica-

tion or fine-mapping analyses. The OMRF gathered the samples

from consenting subjects (according to the guidelines of the ethics

committees at the respective institutions where the samples were

collected) and prepared them for genotyping. All cases used in this

study fulfilled at least 4 of the 11 American College of Rheuma-

tology criteria for SLE, whereas the healthy, population-based

controls did not have any family history of SLE or any other auto-

immune disease.29

Genotyping and Sample Quality ControlA total of 1,580 SNPs that attained p < 0.05 in the previously pub-

lished GWA scan were selected for replication. In addition, 287

SNPs (chosen to capture all variation with aminimum r2 threshold

of 0.8 via the TAGGER algorithm in HAPLOVIEW30) within the

IRF8 region and 347 ancestral-informative markers (AIMs) span-

ning the genome were genotyped. Genotyping of SNPs was per-

formed at OMRF with Infinium chemistry on an Illumina iSelect

custom array according to the manufacturer’s protocol. The

following quality-control procedures were implemented prior to

the analysis (Table S2): there were well-defined clusters within the

scatter plots, the SNP call rate was >90% across all samples geno-

typed, the minor allele frequency was >1%, the sample call rate

was >90%, p > 0.05 for differential missingness between cases

and controls, the total proportion missing was <5%, and the

The American Journal of Human Genetics 90, 648–660, April 6, 2012 649

Hardy-Weinberg proportions were p > 0.01 in controls and p >

0.0001 in cases.

Samples exhibiting excess heterozygosity (>5 standard devia-

tions [SDs] from the mean) or a <90% call rate were excluded

from the analysis. The remaining individuals were examined for

excessive allele sharing as estimated by identity-by-descent

(IBD). In sample pairs with excess relatedness (IBD > 0.4), one

individual was removed from the analysis on the basis of the

following criteria: (1) remove the sample with the lower call rate,

(2) remove the control and retain the case, (3) remove the male

sample before the female sample, (4) remove the younger control

before the older control, and (5) in a situation with two cases, re-

move the case with fewer phenotype data available. Discrepancies

between self-reported and genetically determined gender were

evaluated. Males were required to be heterozygous at rs2557523

(given that the G allele for this SNP is only observed on the

Y chromosome and the A allele appears only on the X chromo-

some) and to have%10% chromosome X heterozygosity. Females

were required to be homozygous for the A allele at rs2557523 and

to have >10% chromosome X heterozygosity.

Ascertainment of Population StratificationGenetic outliers from each ethnic and/or racial group were

removed from further analysis as determined by principal-compo-

nent analysis and admixture estimates (Figure S6).31,32 Using the

163 AIMs that passed quality control in both EIGENSTRAT31 and

ADMIXMAP33,34 to distinguish the four continental ancestral pop-

ulations—Africans, Europeans, Amerindians, and East Asians—

allowed identification of the substructure within the sample

set (Figure S6A).35,36 We utilized principal components from

EIGENSTRAT outputs to identify outliers >4 SDs from the mean

of each of the first three principal components (PC) for the indi-

vidual population clusters. After quality control, a total of 1,139

samples were excluded (Figure S6B and Table S3). Overall, 2,586

subjects were included in the GWA scan, and 15,490 subjects

were included in the replication study, resulting in a total of

18,076 subjects.

Statistical AnalysisTesting for SNP-SLE association in the replication studywas carried

out through the computation of logistic regression as imple-

mented in PLINK v.1.07.37 The calculation of the additive genetic

model included adjustments for the first three PCs and gender.

Models were also adjusted for ancestry with estimates provided

by ADMIXMAP and resulted in no observable difference in associ-

ation as comparedwith PC adjustment.We conducted conditional

likelihood-ratio tests by using the extended WHAP functionality

in PLINK v.1.07. The genome-wide p value threshold for all data

replicating GWA-scan results was p< 53 10�8 after meta-analysis.

For the finemapping and imputation of IRF8, we utilized a Bonfer-

roni-corrected p value threshold of p< 1.093 10�4 on the basis of

the maximum number of tests across all populations (460 inde-

pendent variants with r2 < 0.8). Using METAL,38 we performed

meta-analyses of the SNPs observed in both the GWA scan and

the multiracial replication study with a weighted Z score. Each

racial group was weighted by the square root of its sample size

so that sample-size differences could be controlled for between

studies. We combined all data generated unless the variant failed

quality control in a given racial group.

To test for meta-analysis heterogeneity, we utilized both Co-

chran’s Q test and the I2 index. Cochran’s Q test is a classical

method that calculates the weighted sum of the squared devia-

tions between individual study effects and the overall effect across

studies.39 It follows a chi-square distribution with k-1 degrees of

freedom (k is the number of studies). A value of p < 0.05 was

considered significant evidence of heterogeneity. The I2 index

measures the degree or percentage of inconsistency—due to

heterogeneity rather than random chance—across studies.40 The

I2 index ranges from 0% to 100% and can equal 0%–25% (low

heterogeneity), 26% –50% (moderate heterogeneity), 51%–75%

(high heterogeneity), or 76%–100% (very high heterogeneity).

Linkage disequilibrium (LD) and probable haplotypes were

determined with HAPLOVIEW v.4.2.30 We calculated haplotype

blocks for those haplotypes present at >3% frequency by using

the solid-spine of LD algorithms with minimum r2 values of

0.8.30

ResequencingWe resequenced the IRF8 region (chromosome 16: 84,488,150–

84,539,352 bp) in 206 (92 SLE cases and 114 healthy controls)

European and 46 (25 SLE cases and 21 healthy controls) African

American subjects. For each sample, 3–5 mg of whole genomic

DNA were sheared and prepared for sequencing with an Illumina

Paired-End Genomic DNA Sample Prep Kit. Targeted regions of

interest from each sample were then enriched with a SureSelect

Target Enrichment System utilizing a custom-designed bait pool

(Agilent Technologies). Resequencing was undertaken with an

Illumina GAIIx platform according to standard procedures. Post-

sequence data were processed with Illumina’s Pipeline software

v.1.7. All samples were sequenced to minimum average fold

coverage of 253.

Variant Detection and Quality ControlUnique sequences corresponding to an individual nucleotide

molecule were aligned to the human genome reference (hg18)

with the Burrows-Wheeler Aligner (BWA)41 alignment tool. Reads

were locally realigned around known and suspected insertion and

deletion sites with the Genome Analysis Toolkit (GATK) analysis

suite so that the best possible read alignment could be gener-

ated.42 Recalculation of the correct quality score for each base

within the alignment was then performed empirically with the

GATK suite. This process served to correct overestimated high-

quality scores initially reported by the sequencer itself.

After local realignment around deletion-insertion polymor-

phism (DIP) sites and base quality-score recalibration, SNP and

DIP genotypes were generated for each sample individually as

well as for the samples as a whole. Finally, SNP and DIP genotypes

were hard-filtered against a set of criteria designed to remove any

remaining low-quality calls. For a variant to be included in the call

list, we required a Phred quality score>30, a quality-by-depth ratio

of >5.0, a strand bias score of <�0.10, and a homopolymer

run of <5 bases. The variant phase was determined with the

program BEAGLE.43 Variants meeting call parameters were output

to files compatible with PLINK and other genotyping tools via the

VCFtools analysis suite.

To assess the accuracy of sequence-based SNP calling, we cross-

referenced the sequenced and genotyped allele calls. We observed

~99% concordance between genotypes and sequence-based var-

iant detection, suggesting high-quality sequence data. We manu-

ally inspected the samples withR5% of variants differing between

sequencing and genotyping to determine where sequence quality

was poor. As an additional quality-control measure, we confirmed

650 The American Journal of Human Genetics 90, 648–660, April 6, 2012

each variant identified by our automated workflow by manual

inspection of the assembled contig by using the Integrative

Genomics Viewer (IGV) program.44

ImputationTo increase the informativeness of the IRF8 region, we conducted

imputation in subjects of European, African American, and Asian

ancestry over a 100 kb interval spanning the IRF8 locus. Imputa-

tion of the replication data across chromosome 16 (84.45–84.46

Mb) was performed with IMPUTE2 and the reference panels

provided in Table S7.45–47 Imputed genotypes were required to

meet or exceed a probability threshold of 0.8, an information

measure of >0.4, and the same quality-control-criteria thresholds

described above for inclusion in the analyses.

Results

Two SNPs, rs11648084 and rs11644034, telomeric to IRF8

at 16q24.1 were suggestive of association with SLE in

our published GWA scan (p ¼ 5.99 3 10�4 and 2.29 3

10�3, respectively; odds ratio [OR] ¼ 0.76 and 0.66,

respectively; Table 1). Both rs11648084 and rs11644034

were replicated in the current, independent population

of SLE cases and controls of European ancestry and

exceeded the genome-wide threshold (pmeta-Euro ¼ 2.34 3

10�9 and pmeta-Euro ¼ 2.08 3 10�10, respectively; Table 1).

However, neither SNP was significantly associated with

SLE in any other population studied, perhaps as a result

of the reduced sample size, clinical and/or genetic

heterogeneity, decreased minor allele frequency, and/or

reduced correlation with the causal variants (Table 1 and

Table S4).

To better refine the association signal, 287 additional

SNPs covering ~100 kb encompassing the IRF8 coding

region were genotyped (see Subjects and Methods; Figures

1A and 1D, Table 2, and Table S4). The most significant

association in the European population was with

rs9936079 (p ¼ 3.96 3 10�9, OR ¼ 0.77; Figures 1A and

1D and Table 2), located ~11 kb telomeric to IRF8 and

found to be in strong LD with rs11644034 (r2 ¼ 0.92;

Figures 2B and 2C). The Asian population also exhibited

association with rs9936079 (p ¼ 2.95 3 10�3, OR ¼0.73); however, rs9936079 failed to pass quality-control

measures (see Subjects and Methods; differential missing-

ness p ¼ 10�7) in individuals of African ancestry and

was not associated with disease in patients of Hispanic

or Amerindian ancestry. Meta-analysis yielded pmeta-all ¼9.28 3 10�11 (Table 2 and Table S4).

We observed a modest association in the African Ameri-

cans at rs2934498 (p¼ 3.923 10�4, OR¼ 0.83), which was

also significant in those individuals of European ancestry

(p ¼ 5.96 3 10�6, OR ¼ 1.19) but not in those of Asian

ancestry (Figure 1B,D, Table 2, and Table S4). The strongest

Asian association was observed in a region (rs11117427,

p ¼ 1.99 3 10�5, OR ¼ 0.64) ~34 kb telomeric to IRF8

(Figures 1C and 1D, Table 2, and Table S4). Association

was also observed with rs11117427 (p ¼ 3.46 3 10�4,

OR ¼ 0.84) in the Europeans (Figures 1A and 1D and Table

2). Interestingly, this SNP is only ~2 kb away from

rs12444486, which has been reported by Gateva et al.22

as being suggestive of association with SLE in Europeans

(Figure 1D and Table S4).

We resequenced the IRF8 region in 206 subjects of

European ancestry and in 46 subjects of African American

ancestry to identify variants not previously evaluated

within the IRF8 region and to assess their association

with SLE (see Subjects and Methods). Thirty-eight and

85 variants not present in dbSNP 130 were identified in

European and African American individuals, respectively.

After imputing these data into our larger European and

African American datasets (see Subjects and Methods),

the most significantly associated region within the

European population was ~19 kb telomeric to IRF8

(rs4843869, p ¼ 7.61 3 10�10, OR ¼ 0.76; Table 2 and

Figures 1A and 1D). Ultimately, three strongly correlated

(r2 > 0.90) SNPs emerged as the most significantly associ-

ated with SLE in the Europeans: rs11644034 (identified

via GWA scan), rs9936079 (identified by fine mapping),

and rs4843869 (imputed on the basis of resequencing).

Interestingly, our targeted resequencing revealed a DIP

(rs11347703, p ¼ 1.113 10�8, OR ¼ 0.78) that was located

less than 100 bp from a genotyped SNP, rs8052690

(p ¼ 5.69 3 10�8, OR ¼ 0.79, Table 2). This DIP,

which has high biological plausibility, was in strong LD

with the peak European SNP (rs4843869) and rs8052690

(r2/D’ > 0.9; Figures 2B and 2C). Of note, some of

the African Americans resequenced in this study did

harbor the DIP (rs11347703) identified in the Europeans.

However, neither rs11347703 nor any SNP correlated

with it was found to be significantly associated with

SLE in African Americans, which could be due to the

decrease in power and/or a decrease in the minor allele

frequency.

The peak association in African Americans after imputa-

tion was at rs450443 (p ¼ 1.41 3 10�4, OR ¼ 0.82; Table 2

and Figures 1B and 1D) and was in strong LD (r2 ¼ 0.88)

with rs2934498. Patients of European but not Asian

ancestry showed association with rs450443 (p ¼ 9.73 3

10�6, OR ¼ 1.18; Table 2 and Figures 1A, 1C, and 1D).

We conducted imputation in the Asian population by

using 1,000 Genomes phased haplotypes;48 however,

rs11117427 remained the peak signal (Table 2 and Figures

1C and 1D).

To assess the independence of variants in the European

population, we used logistic regression models that

adjusted for the best tagging SNP at each signal. When

we adjusted for rs4843869 in Europeans, the association

persisted at rs450443, and variants correlated to it.

However, adjusting for rs4843869 negated the association

with rs11117427 and its correlated variants (Figures 2B,

2C, and 3, Table S5, and Figure S2A). Adjusting for either

rs450443 or rs11117427 was only able to negate the asso-

ciations of the polymorphisms that were correlated with

each of these SNPs (Figures 2B, 2C, and 3, Table S5, and

The American Journal of Human Genetics 90, 648–660, April 6, 2012 651

Table 1. SLE Risk Loci Surpassing the Genome-wide Significance Thresholda

Chr SNP Locus Allelesb

European (3,562 Cases/3,491 Controls)

African American(1,527 Cases/1,811Controls)

Asian(1,265 Cases/1,260Controls) Meta

Test ofHeterogeneity

pGWA scanc

ORGWA scan

(95% Cl) pREP

ORREP

(95% Cl) pMETA-Euro p OR (95% Cl) p OR (95% Cl) pMETA-ALLd pQ

e I2

3 rs1132200 TMEM39A G/A 1.65 3 10�3 0.72(0.59–0.88)

2.37 3 10�4 0.83(0.76–0.92)

1.81 3 10�6 6.92 3 10�2 0.75(0.56–1.02)

1.66 3 10�3 0.73(0.59–0.89)

8.62 3 10�9 0.450 0.0%

16 rs11644034 IRF8 G/A 2.29 3 10�3 0.66(0.54–0.79)

2.36 3 10�8 0.78(0.71–0.85)

2.08 3 10�10 5.10 3 10�1 0.95(0.81–1.11)

2.63 3 10�2 0.79(0.65–0.97)

2.72 3 10�9 0.016 61.5%

16 rs11648084 IRF8 G/A 5.99 3 10�4 0.76(0.65–0.89)

9.35 3 10�7 0.83(0.77–0.89)

2.34 3 10�9 6.33 3 10�2 0.90(0.81–1.01)

7.52 3 10�1 1.02(0.91–1.14)

7.00 3 10�7 0.001 74.0%

17 rs9913957 IKZF3 A/G 7.87 3 10�3 1.75(1.27–2.41)

5.14 3 10�4 1.38(1.15–1.66)

1.38 3 10�5 1.07 3 10�2 1.22(1.05–1.41)

� � 1.39 3 10�8 0.105 45.1%

17 rs8076347 IKZF3 C/A 3.07 3 10�3 1.93(1.41–2.62)

3.04 3 10�3 1.32(1.10–1.58)

4.75 3 10�5 2.19 3 10�3 1.20(1.07–1.34)

� � 3.01 3 10�8 0.047 55.4%

17 rs8079075 IKZF3 A/G 1.47 3 10�3 1.90(1.39–2.59)

5.08 3 10�4 1.39(1.16–1.68)

3.81 3 10�6 2.62 3 10�3 1.26(1.08–1.46)

� � 4.83 3 10�9 0.201 31.3%

17 rs1453560 ZPBP2 A/C 7.81 3 10�4 1.92(1.41–2.61)

6.42 3 10�4 1.37(1.14–1.64)

3.21 3 10�6 4.86 3 10�4 1.23(1.09–1.37)

� � 3.48 3 10�10 0.097 46.4%

The following abbreviations are used: Chr, chromosome; OR, odds ratio; GWA, genome-wide association; REP, replication; and CI, confidence interval.aTable S7 contains results for all populations evaluated within this study.bMajor/minor alleles.cResult of GWA scan was previously reported in Graham et al.1dData were combined for all racial groups genotyped within our study that passed quality control.eCochran’s Q test statistic.

652

TheAmerica

nJournalofHumanGenetics

90,648–660,April

6,2012

Figures S2B and S2C). A SNP (rs8046526) in the sixth

intron of IRF8 was also associated with SLE risk in

Europeans (p ¼ 3.96 3 10�6, OR ¼ 0.80) and remained

significant after adjusting for the other SNPs (Table 2,

Figures 1A, 1D, 2B, 2C, and 3, Table S5, and Figure S2D).

Figure 1. Variants in the Region of IRF8Tested for Association with SLEThe association between IRF8 and SLE inEuropean (A), African American (B), andAsian (C) ancestral populations is givenwith observed (blue diamonds) andimputed (red circles) variants. The dottedline represents the Bonferroni-correctedthreshold (p ¼ 1.09 3 10�4) for the fine-mapping study. The solid black line repre-sents the recombination rate. The variantslabeled with blue text represent the mostsignificant observed SNPs, whereas thevariants labeled in red represent themost significant SNPs after imputation.In the Asians, rs11117427 was both themost significant observed and imputedvariant.(D) Shown is an expanded view ofthe most statistically significant regionin Europeans (circles), African Americans(triangles), and Asians (squares) forselected variants tagged by rs8046526(purple), rs450443 (turquoise), rs4843869(yellow-orange), and rs11117427 (green).The following abbreviation is used:Recomb., recombination.

Adjusting for rs8046526 in the Euro-

peans only negated associations for

itself and its correlated variants.

However, adjusting the logistic regres-

sion model for rs8046526, rs450443,

and rs4843869 negated all associa-

tions present in the European popula-

tion (Figures 2B, 2C, and 3, Table S5,

and Figure S2E), demonstrating the

importance of these three IRF8 vari-

ants for SLE risk.

Haplotype analysis identified a

single risk haplotype (H2) (p¼ 6.423

10�8) with a frequency of 18.4% in

the European individuals (Figure 2).

Two significant protective haplo-

types, H6 and H7, were also identified

(Figure 2). The risk-associated alleles

within the region bounded by SNPs

rs11117426–rs34912238 (the peak

Asian effect) were also present in the

most significant protective haplo-

type, H7, suggesting that this region

might not impact disease risk in

Europeans (Figure 2). The only differ-

ences between H3 and H6 as well as

between H4 and H7 are rs8046526

and rs8058904 in the minor form, suggesting that these

SNPs are important in conferring protection from disease

(Figure 2). The only differences between H2 and H5 (which

are not statistically significant) are the major alleles (for

SNPs rs8046526 and rs8058904) residing on the H2

The American Journal of Human Genetics 90, 648–660, April 6, 2012 653

Table 2. IRF8 Variants Associated with SLEa

SNPGenotypedor Imputed Position (bp)

European

pAfrican-American pAsianAllelesb MAFc p OR (95% Cl)

rs8046526 I 84,509,136 C/T 0.14/0.16 3.96 3 10�6 0.80 (0.73–0.88) � �

rs8058904 G 84,509,183 A/G 0.14/0.16 5.14 3 10�6 0.80 (0.73–0.88) 1.96 3 10�1 �

rs9936079 G 84,525,095 G/A 0.17/0.22 3.96 3 10�9 0.77 (0.70–0.84) � 2.95 3 10�3

rs385344 I 84,525,105 C/G 0.30/0.27 1.37 3 10�5 1.18 (1.10–1.27) � 2.43 3 10�1

rs34337659 I 84,525,158 T/C 0.31/0.27 1.55 3 10�5 1.18 (1.10–1.27) 3.97 3 10�4 1.14 3 10�1

rs66509440 I 84,525,182 C/T 0.28/0.25 6.36 3 10�6 1.20 (1.11–1.30) 3.97 3 10�4 �

rs66804793 I 84,525,190 G/A 0.28/0.25 6.16 3 10�6 1.20 (1.11–1.30) 3.84 3 10�4 �

rs74032085 I 84,525,245 T/C 0.28/0.24 2.25 3 10�6 1.22 (1.12-1.32) 4.72 3 10�4 �

rs16940044 I 84,525,266 A/G 0.27/0.24 2.16 3 10�6 1.22 (1.12–1.32) 4.73 3 10�4 �

rs2934497 I 84,525,379 C/T 0.27/0.23 2.62 3 10�7 1.24 (1.14–1.35) 5.12 3 10�4 1.95 3 10�1

rs2970091 I 84,525,387 G/A 0.28/0.24 2.96 3 10�7 1.24 (1.14–1.34) 5.31 3 10�4 1.93 3 10�1

rs2934498 G 84,525,783 A/G 0.31/0.27 5.96 3 10�6 1.19 (1.11–1.29) 3.92 3 10�4 2.09 3 10�1

rs439885 G 84,526,175 G/A 0.31/0.27 1.16 3 10�5 1.19 (1.10–1.28) 5.59 3 10�4 1.98 3 10�1

rs450443 I 84,526,392 T/G 0.30/0.27 9.73 3 10�6 1.18 (1.10–1.28) 1.41 3 10�4 2.04 3 10�1

rs396987 I 84,526,435 A/G 0.30/0.27 8.89 3 10�6 1.19 (1.10–1.28) 5.44 3 10�4 2.04 3 10�1

rs4843865 G 84,526,806 T/A 0.17/0.21 2.93 3 10�8 0.78 (0.72–0.85) 6.64 3 10�1 1.46 3 10�2

rs11347703 I 84,527,141 G/� 0.18/0.21 1.11 3 10�8 0.78 (0.72–0.85) 5.73 3 10�1 �

rs8052690 G 84,527,239 A/G 0.18/0.21 5.69 3 10�8 0.79 (0.72–0.86) 6.58 3 10�1 5.28 3 10�3

rs186249 G 84,528,397 G/C 0.30/0.26 1.66 3 10�5 1.19 (1.10–1.28) 1.63 3 10�2 9.37 3 10�1

rs11117422 G 84,529,514 G/C 0.17/0.21 9.37 3 10�9 0.77 (0.71–0.84) 6.76 3 10�1 1.13 3 10�2

rs11644034 G 84,530,113 G/A 0.17/0.20 2.36 3 10�8 0.78 (0.71–0.85) 5.10 3 10�1 2.63 3 10�2

rs305066 I 84,530,277 C/T 0.33/0.29 8.18 3 10�6 1.18 (1.10–1.27) � 3.62 3 10�1

rs13335265 G 84,530,311 C/G 0.16/0.20 1.23 3 10�8 0.77 (0.70–0.84) 2.86 3 10�1 1.06 3 10�2

rs12711490 G 84,530,529 A/G 0.17/0.20 2.11 3 10�8 0.78 (0.71–0.85) 4.51 3 10�1 7.55 3 10�2

rs11641153 I 84,530,641 A/G 0.16/0.20 1.31 3 10�9 0.76 (0.70–0.83) 4.85 3 10�1 7.02 3 10�2

rs11641155 I 84,530,653 A/G 0.16/0.20 1.23 3 10�9 0.76 (0.70–0.83) � 7.02 3 10�2

rs7205434 I 84,530,696 C/G 0.16/0.20 1.23 3 10�9 0.76 (0.70–0.83) 4.85 3 10�1 7.02 3 10�2

rs4843868 I 84,530,902 C/T 0.16/0.20 8.57 3 10�10 0.76 (0.70–0.83) 5.82 3 10�1 7.02 3 10�2

rs305063 G 84,532,158 C/A 0.32/0.29 7.34 3 10�5 1.17 (1.08–1.26) � 9.66 3 10�1

rs4843323 I 84,532,462 C/T 0.16/0.20 7.71 3 10�10 0.76 (0.70–0.83) � �

rs4843869 I 84,532,642 G/A 0.16/0.20 7.61 3 10�10 0.76 (0.70–0.83) 4.19 3 10�1 6.59 3 10�2

rs7202472 G 84,535,003 C/A 0.15/0.19 7.25 3 10�9 0.77 (0.70–0.84) 3.94 3 10�1 1.41 3 10�2

rs11117426 G 84,547,768 A/G 0.16/0.19 2.42 3 10�4 0.84 (0.77–0.92) 1.07 3 10�1 2.12 3 10�5

rs11117427 G 84,548,058 G/A 0.16/0.18 3.46 3 10�4 0.84 (0.77–0.93) � 1.99 3 10�5

rs12445476 G 84,548,770 A/C 0.16/0.18 1.76 3 10�4 0.84 (0.76–0.92) 6.49 3 10�2 2.19 3 10�5

rs11642873 G 84,549,206 A/C 0.15/0.18 2.92 3 10�4 0.84 (0.77–0.92) 8.82 3 10�1 5.63 3 10�5

rs34912238 I 84,559,404 C/T 0.16/0.19 2.15 3 10�5 0.82 (0.75–0.90) � �

The following abbreviations are used: G, genotyped; I, imputed; MAF, minor-allele frequency; OR, odds ratio; and CI, confidence interval.aAll subjects, including the 374 that were removed so that the replication study was independent from the GWA scan, were imputed. Tables S4 and S5 containresults for all populations evaluated within this study.bMajor/minor alleles.cCase/control.

654 The American Journal of Human Genetics 90, 648–660, April 6, 2012

haplotype and the minor alleles on the neutral H5 haplo-

type. Thus, it appears that all three regions (tagged by

rs8046526, rs450443, and rs4843869) are required for

risk. Many variants residing on the risk haplotype are

within regions known to bind multiple transcription

factors in the ENCODE ChIP-Seq project dataset in immu-

nologic cell types (Figures S3 and S4).49 Thus, we hypoth-

esize that the risk haplotype has the potential to affect

the regulation of IRF8 expression and/or the expression

of other genes in the region.

Within TMEM39A in region 3q13.33, a coding SNP

(rs1132200) that demonstrated suggestive evidence of

association in our previous GWA scan (p ¼ 1.65 3 10�3)

was also confirmed in the European replication study (p ¼2.37 3 10�4, OR ¼ 0.83; Table 1 and Table S6). This nonsy-

nonymous SNP showed association with SLE in Asian

patients (p ¼ 1.66 3 10�3, OR ¼ 0.73) but not in African

Americans, Hispanics, Gullah, or Amerindians (Table 1

and Table S6). When analyzing this SNP in all populations

that passed quality control, a meta-analysis produced

pmeta-all ¼ 8.62 3 10�9, and no evidence of heterogeneity

was observed between these datasets (Table 1 and Table S6).

Finally, we replicated several SNPs in the 17q12 region

between IKZF3 and ZPBP2 (Table 1 and Table S6). Three

SNPs within IKZF3 replicated, and rs8079075 was the

most significant SNP in both the samples of European

(p ¼ 5.08 3 10�4, OR ¼ 1.39) and African American (p ¼2.62 3 10�3, OR ¼ 1.26) ancestry (pmeta-all ¼ 4.83 3 10�9).

The most significant SNP in this region (rs1453560) is

located between IKZF3 and ZPBP2 and was replicated in

European (p ¼ 6.42 3 10�4, OR ¼ 1.37) and African

American (p ¼ 4.86 3 10�4, OR ¼ 1.23) ancestral popula-

tions; the replication resulted in pmeta-all ¼ 3.48 3 10�10

(Table 1 and Table S6). All four SNPs are highly correlated

(r2 > 0.95). Even though Cochran’s Q test of heterogeneity

was not statistically significant, we observed moderate

heterogeneity by the I2 index, perhaps as a result of the

differences in allele frequency between the racial groups

(Table 1). IKZF3 and ZPBP2 are transcribed in opposite

directions of one another but share the same promoter

region (Figure S5). The ENCODE ChIP-Seq project has

identified multiple transcription-factor binding sites for

chromatin in the chromosomal region surrounding

rs1453560 (Figure S5).49

In addition to the three regions (described above) that

now exceed genome-wide significance, 11 loci were repli-

cated in the European SLE cases but did not exceed

genome-wide significance (5 3 10�8 < pmeta-Euro < 9.99 3

10�5). These 11 loci include the following: CFHR1 (MIM

134371), CADM2, LOC730109/IL12A (MIM 161560),

LPP (MIM 600700), LOC63920, SLU7 (MIM 605974),

ADAMTSL1 (MIM 609198), C10orf64, OR8D4, FAM19A2,

and STXBP6 (MIM 607958) (Table 3 and Table S7).

Discussion

The interferon regulatory factors are a family of transcrip-

tion factors that play a critical role in the regulation of

several pathways, including the response to pathogens,

apoptosis, the cell cycle, and hematopoietic differentia-

tion.50 IRF8 is expressed in the nucleus (but partially

in the cytoplasm) of B cells, macrophages, and CD11b

dendritic cells (DCs).50 IRF8 can be induced by inter-

feron-g in macrophages and antigen stimulation within

T cells. It also plays an important role in the development

of B cells and macrophages.50 In the nucleus, IRF8 is

required for promoting type I interferon responses in

DCs upon viral stimulation.50 Interestingly, the overex-

pression of genes induced by type I interferons has been

widely reported in SLE and other autoimmune condi-

tions.12,51,52 In the cytosol, IRF8 is involved in the TLR9-

MyD88-dependent signaling by binding to TRAF6 in

both DCs and macrophages.50 After TLR9 stimulation,

DCs from mice that are Irf8�/� cannot activate NF-kB or

MAPKs.50 Of note, rs17445836, which was not included

Figure 2. Conditional-Analysis Results Conducted in Individuals of European AncestryThe results of the conditional analysis with the four SNPs show peak association in Europeans (rs8046526 and rs4843869), African Amer-icans (rs450443), and Asians (rs11117427). The black dot represents the unadjusted single-marker association with SLE.

The American Journal of Human Genetics 90, 648–660, April 6, 2012 655

in our study but has been associated withmultiple sclerosis

(MIM 126200), lies approximately 61 kb telomeric to IRF8

and is far removed from the regions identified in SLE.53

Finemapping, resequencing, imputation, and haplotype

analysis of the IRF8 locus in Europeans identified a single

haplotype requiring the presence of three independent

effects to confer risk. Additionally, several variants within

the IRF8 risk haplotype might influence binding to the

many regulatory elements present within the region.

Thus, we hypothesize that the likely functional effect

would result in altered IRF8 mRNA and protein expression.

Although we believe that most of the common variation

within the IRF8 region has been evaluated in this

study, it is possible that some variants with minor allele

frequencies <1% also play a role in SLE risk but were not

detected in our study because of the number of samples

resequenced.

The TMEM39A-associated coding SNP (rs1132200)

results in an amino acid change from alanine to threonine

at position 487 of the protein. Although almost no

biological data have been published suggesting its rele-

vance to SLE, it has been found to be associated with

multiple sclerosis.54 Better understanding whether the

coding SNP in TMEM39A is functionally relevant or is

merely correlated with other unexamined causal polymor-

phism(s) will require mechanistic and fine-mapping

experiments.

Although the region surrounding the IKZF3-ZPBP2

locus at 17q21 has been associated with multiple pheno-

types, the extensive LD in the region has prohibited inves-

tigators from clearly determining the relevant gene. Crohn

disease (MIM 266600), ulcerative colitis (MIM 266600),

primary biliary cirrhosis (MIM 109720), and rheumatoid

arthritis (MIM 180300) all have reported associations

Figure 3. IRF8 Haplotype and LD in the European Ancestral Population(A) Haplotype structure in Europeans present at a frequency>3%.Major alleles are represented by red squares, whereas the green squaresare minor alleles.(B and C) A LD plot of r2 (B) and D’ (C) in Europeans illustrates that the variants tagged by rs450443 and those tagged by rs4843869are in weak r2 but strong D’, providing evidence that these variants are inherited together.

656 The American Journal of Human Genetics 90, 648–660, April 6, 2012

with genes between 34.62–35.51 Mb of chromosome

17.55–60 Fine mapping and resequencing of this region in

Europeans and African Americans are needed if researchers

are to more precisely refine this association and determine

the loci associated with risk. IKZF3 is a member of the

IKAROS family of transcription factors involved in

lymphocyte development; IKZF1 in this family has already

been reported as a risk locus for SLE.22 Mice with a mutant

form of IKZF3 produce anti-dsDNA autoantibodies,

making it an interesting candidate gene for human

SLE.61 Moreover, mice that are null for IKZF3 and OBF-1

(POU class 2 associating factor 1) do not mount an autoim-

mune response.61 The peak signal in our study was in

a region containing multiple regulatory elements, so it is

likely that the associated SNP could affect expression of

IKZF3 or ZPBP2, which both share the promoter region.

However, no known function of ZPBP2 has been reported.

Eleven additional regions were replicated in the Euro-

pean subjects but did not surpass genome-wide signifi-

cance. Of these regions, LOC730108/IL12A was previously

reported as a risk locus for primary biliary cirrhosis and

multiple sclerosis.60,62 IL-12A induces interferon-gamma

and helps differentiate Th1 and Th2 cells.63 The response

of lymphocytes to IL-12A is mediated by STAT4, which is

also implicated in SLE pathogenesis.64 The LIM domain

containing preferred translocation partner in lipoma

(LPP) is involved in focal adhesions, cell-cell adhesion,

and cell motility. Variants within the LPP region have

been associated with vitiligo and celiac disease.65,66 Con-

firmation of these associations will require replication in

a larger independent and equally diverse population.

In conclusion, we have robustly established three

additional susceptibility loci for SLE: IRF8, TMEM39A,

and IKZF3-ZPBP2. Eleven other regions were replicated

but did not exceed the genome-wide threshold of signifi-

cance. Collectively, these data, along with other previously

reported loci, demonstrate the growing complexity of

the heritable contribution to SLE pathogenesis. A complete

understanding of how genetics influence the pathophysi-

ology of SLE will only be possible once we have identified

all contributing loci and functional and/or causal variants

for each association and have extensively evaluated the

role of rare variants. More work will be required if we are

to increase our understanding of how the loci identified

in this study influence SLE etiology.

Supplemental Data

Supplemental Data include supplemental acknowledgments, six

figures, and seven tables and can be found with this article online

at http://www.cell.com/AJHG.

Acknowledgments

We are grateful to all the individuals with SLE and those serving

as healthy controls who participated in this study. We thank the

following individuals for contributing samples: Sandra Marc Bijl,

D’Alfonso, Emoke Endreffy, Inigo Rua-Figueroa, Cintia Garcilazo,

Carmen Gutierrez, Peter Junker, Helle Laustrup, Rafaella Scorza,

Table 3. Replicated Loci that Demonstrate Suggestive Evidence of SLE Riska

SNP Locus Allelesb

EuropeanTest ofHeterogeneity

pGWA scanc

ORGWA scan

(95% Cl) pREP ORREP (95% Cl) pMETA-Euro pQd I2

rs7542235 CFHR1 A/G 3.94 3 10�3 1.30 (1.11–1.54) 1.10 3 10�3 1.15 (1.06–1.25) 1.85 3 10�5 0.180 44.4%

rs485499 LOC730109/IL12A

A/G 2.14 3 10�3 0.75 (0.65–0.87) 1.47 3 10�4 0.87 (0.81–0.94) 1.31 3 10�6 0.076 68.2%

rs669003 LOC730109/IL12A

A/G 2.16 3 10�3 0.75 (0.65–0.87) 1.15 3 10�4 0.87 (0.81–0.93) 1.02 3 10�6 0.081 67.2%

rs7631930 LPP A/G 3.60 3 10�3 1.25 (1.06–1.49) 1.66 3 10�3 1.15 (1.05–1.25) 2.71 3 10�5 0.375 0.00%

rs9310002 CADM2 G/A 2.09 3 10�3 2.06 (1.38–3.07) 6.12 3 10�3 1.39 (1.10–1.76) 8.30 3 10�5 0.099 63.3%

rs1075059 LOC63920 A/C 7.30 3 10�4 0.82 (0.71–0.95) 7.26 3 10�3 0.91 (0.85–0.97) 5.27 3 10�5 0.221 33.2%

rs1895321 SLU7 A/C 4.21 3 10�3 1.22 (1.06–1.41) 3.11 3 10�3 1.11 (1.04–1.19) 6.09 3 10�5 0.260 21.2%

rs7039790 ADAMTSL1 C/A 5.36 3 10�3 1.62 (1.24–2.12) 1.14 3 10�3 1.27 (1.10–1.47) 2.38 3 10�5 0.124 57.7%

rs2940712 C10orf64 G/A 4.72 3 10�3 0.79 (0.67–0.91) 8.73 3 10�4 0.88 (0.82–0.95) 1.62 3 10�5 0.178 45.0%

rs10790605 OR8D4 G/A 2.02 3 10�3 0.80 (0.67–0.95) 4.35 3 10�3 0.88 (0.81–0.96) 5.39 3 10�5 0.321 0.00%

rs7960162 FAM19A2 A/G 4.95 3 10�3 0.76 (0.61–0.94) 4.31 3 10�3 0.87 (0.79–0.96) 9.79 3 10�5 0.253 23.6%

rs749373 STXBP6 A/G 5.44 3 10�3 1.34 (1.11–1.62) 2.32 3 10�3 1.16 (1.05–1.27) 5.26 3 10�5 0.171 46.7%

The following abbreviation is used: GWA, genome-wide association; OR, odds ratio; CI, confidence interval; and REP, replication.aTable S7 contains results for all populations evaluated within this study.bMajor/minor alleles.cGWA scan previously reported in Graham et al.1dCochran’s Q test statistic.

The American Journal of Human Genetics 90, 648–660, April 6, 2012 657

BertaMartins da Silva, Ana Suarez, and Carlos Vasconcelos. For the

GENLES collaboration, we thank Eduardo Acevedo, Mario Cardiel,

Ignacio Garcıa de la Torre, Mabel Busajm, Cecilia Castel, Marco

Maradiaga, Jose F. Moctezuma, and Jorge Musuruana. For the

Asociacion Andaluza de Enfermedades Autoimmunes collabora-

tion, we thank Juan Jimenez-Alonso, Norberto Ortego-Centeno,

Enrique de Ramon, and Julio Sanchez-Roman. We would like to

thank Summer Frank and Mei Li Zhu for their assistance in geno-

typing, quality-control analyses, and clinical data management.

We would also like to thank Emily Cole for her assistance in

preparing figures. Grant support information is provided in the

Supplemental Acknowledgments available online.

Received: October 26, 2011

Revised: February 22, 2012

Accepted: February 22, 2012

Published online: March 29, 2012

Web Resources

The URLs for data presented herein are as follows:

dbSNP, http://www.ncbi.nlm.nih.gov/projects/SNP/index.html

Encyclopedia of DNA Elements (ENCODE), http://genome.ucsc.

edu/ENCODE/

HUGO Gene Nomenclature Committee, http://www.genenames.

org/

LocusZoom, http://csg.sph.umich.edu/locuszoom/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

UCSC Genome Browser, http://genome.ucsc.edu/

VCFtools, http://www.vcftools.sourceforge.net

References

1. Petri, M. (2002). Epidemiology of systemic lupus erythemato-

sus. Best Pract. Res. Clin. Rheumatol. 16, 847–858.

2. Moser, K.L., Kelly, J.A., Lessard, C.J., and Harley, J.B. (2009).

Recent insights into the genetic basis of systemic lupus eryth-

ematosus. Genes Immun. 10, 373–379.

3. Alarcon-Segovia, D., Alarcon-Riquelme, M.E., Cardiel, M.H.,

Caeiro, F., Massardo, L., Villa, A.R., and Pons-Estel, B.A.; Grupo

Latinoamericano de Estudio del Lupus Eritematoso (GLADEL).

(2005). Familial aggregation of systemic lupus erythematosus,

rheumatoid arthritis, and other autoimmune diseases in 1,177

lupus patients from the GLADEL cohort. Arthritis Rheum. 52,

1138–1147.

4. Anaya, J.M., Tobon, G.J., Vega, P., and Castiblanco, J. (2006).

Autoimmune disease aggregation in families with primary

Sjogren’s syndrome. J. Rheumatol. 33, 2227–2234.

5. Arora-Singh, R.K., Assassi, S., del Junco, D.J., Arnett, F.C., Perry,

M., Irfan, U., Sharif, R., Mattar, T., and Mayes, M.D. (2010).

Autoimmune diseases and autoantibodies in the first degree

relatives of patients with systemic sclerosis. J. Autoimmun.

35, 52–57.

6. Sestak, A.L., Shaver, T.S., Moser, K.L., Neas, B.R., and Harley,

J.B. (1999). Familial aggregation of lupus and autoimmunity

in an unusual multiplex pedigree. J. Rheumatol. 26, 1495–

1499.

7. Deng, Y., and Tsao, B.P. (2010). Genetic susceptibility to

systemic lupus erythematosus in the genomic era. Nat. Rev.

Rheumatol. 6, 683–692.

8. Reinertsen, J.L., Klippel, J.H., Johnson, A.H., Steinberg, A.D.,

Decker, J.L., and Mann, D.L. (1978). B-lymphocyte alloanti-

gens associated with systemic lupus erythematosus. N. Engl.

J. Med. 299, 515–518.

9. Nies, K.M., Brown, J.C., Dubois, E.L., Quismorio, F.P., Friou,

G.J., and Terasaki, P.I. (1974). Histocompatibility (HL-A) anti-

gens and lymphocytotoxic antibodies in systemic lupus

erythematosus (SLE). Arthritis Rheum. 17, 397–402.

10. Graham, R.R., Ortmann, W.A., Langefeld, C.D., Jawaheer, D.,

Selby, S.A., Rodine, P.R., Baechler, E.C., Rohlf, K.E., Shark,

K.B., Espe, K.J., et al. (2002). Visualizing human leukocyte

antigen class II risk haplotypes in human systemic lupus

erythematosus. Am. J. Hum. Genet. 71, 543–553.

11. McCulloch, D.K., Klaff, L.J., Kahn, S.E., Schoenfeld, S.L.,

Greenbaum, C.J., Mauseth, R.S., Benson, E.A., Nepom, G.T.,

Shewey, L., and Palmer, J.P. (1990). Nonprogression of subclin-

ical beta-cell dysfunction among first-degree relatives of

IDDM patients. 5-yr follow-up of the Seattle Family Study.

Diabetes 39, 549–556.

12. Baechler, E.C., Batliwalla, F.M., Karypis, G., Gaffney, P.M., Ort-

mann, W.A., Espe, K.J., Shark, K.B., Grande, W.J., Hughes,

K.M., Kapur, V., et al. (2003). Interferon-inducible gene

expression signature in peripheral blood cells of patients

with severe lupus. Proc. Natl. Acad. Sci. USA 100, 2610–2615.

13. Bennett, L., Palucka, A.K., Arce, E., Cantrell, V., Borvak, J., Ban-

chereau, J., and Pascual, V. (2003). Interferon and granulopoi-

esis signatures in systemic lupus erythematosus blood. J. Exp.

Med. 197, 711–723.

14. Kirou, K.A., Lee, C., George, S., Louca, K., Papagiannis, I.G.,

Peterson, M.G., Ly, N., Woodward, R.N., Fry, K.E., Lau, A.Y.,

et al. (2004). Coordinate overexpression of interferon-alpha-

induced genes in systemic lupus erythematosus. Arthritis

Rheum. 50, 3958–3967.

15. Sigurdsson, S., Nordmark, G., Goring, H.H., Lindroos, K.,

Wiman, A.C., Sturfelt, G., Jonsen, A., Rantapaa-Dahlqvist, S.,

Moller, B., Kere, J., et al. (2005). Polymorphisms in the tyro-

sine kinase 2 and interferon regulatory factor 5 genes are

associated with systemic lupus erythematosus. Am. J. Hum.

Genet. 76, 528–537.

16. Hom, G., Graham, R.R., Modrek, B., Taylor, K.E., Ortmann,

W., Garnier, S., Lee, A.T., Chung, S.A., Ferreira, R.C., Pant,

P.V., et al. (2008). Association of systemic lupus erythematosus

with C8orf13-BLK and ITGAM-ITGAX. N. Engl. J. Med. 358,

900–909.

17. Harley, J.B., Alarcon-Riquelme, M.E., Criswell, L.A., Jacob,

C.O., Kimberly, R.P., Moser, K.L., Tsao, B.P., Vyse, T.J., Lange-

feld, C.D., Nath, S.K., et al; International Consortium for

Systemic Lupus Erythematosus Genetics (SLEGEN). (2008).

Genome-wide association scan in womenwith systemic lupus

erythematosus identifies susceptibility variants in ITGAM,

PXK, KIAA1542 and other loci. Nat. Genet. 40, 204–210.

18. Yang, W., Shen, N., Ye, D.Q., Liu, Q., Zhang, Y., Qian, X.X.,

Hirankarn, N., Ying, D., Pan, H.F., Mok, C.C., et al; Asian

Lupus Genetics Consortium. (2010). Genome-wide associa-

tion study in Asian populations identifies variants in ETS1

and WDFY4 associated with systemic lupus erythematosus.

PLoS Genet. 6, e1000841.

19. Graham, R.R., Cotsapas, C., Davies, L., Hackett, R., Lessard,

C.J., Leon, J.M., Burtt, N.P., Guiducci, C., Parkin, M., Gates,

C., et al. (2008). Genetic variants near TNFAIP3 on 6q23 are

associated with systemic lupus erythematosus. Nat. Genet.

40, 1059–1061.

658 The American Journal of Human Genetics 90, 648–660, April 6, 2012

20. Kozyrev, S.V., Abelson, A.K., Wojcik, J., Zaghlool, A., Linga

Reddy, M.V., Sanchez, E., Gunnarsson, I., Svenungsson, E.,

Sturfelt, G., Jonsen, A., et al. (2008). Functional variants in

the B-cell gene BANK1 are associated with systemic lupus

erythematosus. Nat. Genet. 40, 211–216.

21. Han, J.W., Zheng, H.F., Cui, Y., Sun, L.D., Ye, D.Q., Hu, Z., Xu,

J.H., Cai, Z.M., Huang, W., Zhao, G.P., et al. (2009). Genome-

wide association study in a Chinese Han population identifies

nine new susceptibility loci for systemic lupus erythematosus.

Nat. Genet. 41, 1234–1237.

22. Gateva, V., Sandling, J.K., Hom, G., Taylor, K.E., Chung, S.A.,

Sun, X., Ortmann, W., Kosoy, R., Ferreira, R.C., Nordmark,

G., et al. (2009). A large-scale replication study identifies

TNIP1, PRDM1, JAZF1, UHRF1BP1 and IL10 as risk loci for

systemic lupus erythematosus. Nat. Genet. 41, 1228–1233.

23. Lessard, C.J., Adrianto, I., Kelly, J.A., Kaufman, K.M., Grun-

dahl, K.M., Adler, A., Williams, A.H., Gallant, C.J., Anaya,

J.M., Bae, S.C., et al; Marta E. Alarcon-Riquelme on behalf of

the BIOLUPUS and GENLES Networks. (2011). Identification

of a systemic lupus erythematosus susceptibility locus at

11p13 between PDHX and CD44 in a multiethnic study.

Am. J. Hum. Genet. 88, 83–91.

24. Namjou, B., Kothari, P.H., Kelly, J.A., Glenn, S.B., Ojwang,

J.O., Adler, A., Alarcon-Riquelme, M.E., Gallant, C.J., Boackle,

S.A., Criswell, L.A., et al. (2011). Evaluation of the TREX1 gene

in a large multi-ancestral lupus cohort. Genes Immun. 12,

270–279.

25. Adrianto, I., Wen, F., Templeton, A., Wiley, G., King, J.B.,

Lessard, C.J., Bates, J.S., Hu, Y., Kelly, J.A., Kaufman, K.M.,

et al; BIOLUPUS and GENLES Networks. (2011). Association

of a functional variant downstream of TNFAIP3 with systemic

lupus erythematosus. Nat. Genet. 43, 253–258.

26. Tan,W., Sunahori, K., Zhao, J., Deng, Y., Kaufman, K.M., Kelly,

J.A., Langefeld, C.D., Williams, A.H., Comeau, M.E., Ziegler,

J.T., et al; BIOLUPUS Network; GENLES Network. (2011).

Association of PPP2CA polymorphisms with systemic lupus

erythematosus susceptibility in multiple ethnic groups.

Arthritis Rheum. 63, 2755–2763.

27. Zhao, J., Wu, H., Khosravi, M., Cui, H., Qian, X., Kelly, J.A.,

Kaufman, K.M., Langefeld, C.D., Williams, A.H., Comeau,

M.E., et al; BIOLUPUS Network; GENLES Network. (2011).

Association of genetic variants in complement factor H and

factor H-related genes with systemic lupus erythematosus

susceptibility. PLoS Genet. 7, e1002079.

28. Sanchez, E., Nadig, A., Richardson, B.C., Freedman, B.I., Kauf-

man, K.M., Kelly, J.A., Niewold, T.B., Kamen, D.L., Gilkeson,

G.S., Ziegler, J.T., et al; BIOLUPUS andGENLES. (2011). Pheno-

typic associations of genetic susceptibility loci in systemic

lupus erythematosus. Ann. Rheum. Dis. 70, 1752–1757.

29. Hochberg, M.C. (1997). Updating the American College of

Rheumatology revised criteria for the classification of systemic

lupus erythematosus. Arthritis Rheum. 40, 1725.

30. Barrett, J.C., Fry, B., Maller, J., and Daly, M.J. (2005). Haplo-

view: Analysis and visualization of LD and haplotype maps.

Bioinformatics 21, 263–265.

31. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E.,

Shadick, N.A., and Reich, D. (2006). Principal components

analysis corrects for stratification in genome-wide association

studies. Nat. Genet. 38, 904–909.

32. McKeigue, P.M., Carpenter, J.R., Parra, E.J., and Shriver, M.D.

(2000). Estimation of admixture and detection of linkage in

admixed populations by a Bayesian approach: Application

to African-American populations. Ann. Hum. Genet. 64,

171–186.

33. Hoggart, C.J., Parra, E.J., Shriver, M.D., Bonilla, C., Kittles,

R.A., Clayton, D.G., and McKeigue, P.M. (2003). Control of

confounding of genetic associations in stratified populations.

Am. J. Hum. Genet. 72, 1492–1504.

34. Hoggart, C.J., Shriver, M.D., Kittles, R.A., Clayton, D.G., and

McKeigue, P.M. (2004). Design and analysis of admixture

mapping studies. Am. J. Hum. Genet. 74, 965–978.

35. Smith, M.W., Patterson, N., Lautenberger, J.A., Truelove, A.L.,

McDonald, G.J., Waliszewska, A., Kessing, B.D., Malasky, M.J.,

Scafe, C., Le, E., et al. (2004). A high-density admixture map

for disease gene discovery in african americans. Am. J. Hum.

Genet. 74, 1001–1013.

36. Halder, I., Shriver, M., Thomas, M., Fernandez, J.R., and

Frudakis, T. (2008). A panel of ancestry informative markers

for estimating individual biogeographical ancestry and

admixture from four continents: Utility and applications.

Hum. Mutat. 29, 648–658.

37. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira,

M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly,

M.J., and Sham, P.C. (2007). PLINK: A tool set for whole-

genome association and population-based linkage analyses.

Am. J. Hum. Genet. 81, 559–575.

38. Willer, C.J., Li, Y., and Abecasis, G.R. (2010). METAL: Fast

and efficient meta-analysis of genomewide association scans.

Bioinformatics 26, 2190–2191.

39. Cochran, W.G. (1954). The Combination of Estimates from

Different Experiments. Biometrics 10, 101–129.

40. Higgins, J.P., Thompson, S.G., Deeks, J.J., and Altman, D.G.

(2003). Measuring inconsistency in meta-analyses. BMJ 327,

557–560.

41. Li, H., and Durbin, R. (2009). Fast and accurate short read

alignment with Burrows-Wheeler transform. Bioinformatics

25, 1754–1760.

42. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis,

K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly,

M., and DePristo, M.A. (2010). The Genome Analysis Toolkit:

A MapReduce framework for analyzing next-generation DNA

sequencing data. Genome Res. 20, 1297–1303.

43. Browning, S.R., and Browning, B.L. (2007). Rapid and accurate

haplotype phasing and missing-data inference for whole-

genome association studies by use of localized haplotype clus-

tering. Am. J. Hum. Genet. 81, 1084–1097.

44. Robinson, J.T., Thorvaldsdottir, H., Winckler, W., Guttman,

M., Lander, E.S., Getz, G., and Mesirov, J.P. (2011). Integrative

genomics viewer. Nat. Biotechnol. 29, 24–26.

45. Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve,

L.L., Gibbs, R.A., Belmont, J.W., Boudreau, A., Hardenbol, P.,

Leal, S.M., et al; International HapMap Consortium. (2007).

A second generation human haplotype map of over 3.1

million SNPs. Nature 449, 851–861.

46. Howie, B.N., Donnelly, P., and Marchini, J. (2009). A flexible

and accurate genotype imputationmethod for the next gener-

ation of genome-wide association studies. PLoS Genet. 5,

e1000529.

47. Via, M., Gignoux, C., and Burchard, E.G. (2010). The 1000

Genomes Project: New opportunities for research and social

challenges. Genome Med. 2, 3.

48. 1000 Genomes Project Consortium. (2010). A map of human

genome variation from population-scale sequencing. Nature

467, 1061–1073.

The American Journal of Human Genetics 90, 648–660, April 6, 2012 659

49. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R.,

Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermit-

zakis, E.T., Thurman, R.E., et al; ENCODE Project Consortium;

NISC Comparative Sequencing Program; Baylor College of

Medicine Human Genome Sequencing Center; Washington

University Genome Sequencing Center; Broad Institute;

Children’s Hospital Oakland Research Institute. (2007).

Identification and analysis of functional elements in 1% of

the human genome by the ENCODE pilot project. Nature

447, 799–816.

50. Tamura, T., Yanai, H., Savitsky, D., and Taniguchi, T. (2008).

The IRF family transcription factors in immunity and onco-

genesis. Annu. Rev. Immunol. 26, 535–584.

51. Baechler, E.C., Gregersen, P.K., and Behrens, T.W. (2004).

The emerging role of interferon in human systemic lupus

erythematosus. Curr. Opin. Immunol. 16, 801–807.

52. Baechler, E.C., Batliwalla, F.M., Reed, A.M., Peterson, E.J.,

Gaffney, P.M., Moser, K.L., Gregersen, P.K., and Behrens,

T.W. (2006). Gene expression profiling in human autoimmu-

nity. Immunol. Rev. 210, 120–137.

53. De Jager, P.L., Jia, X., Wang, J., de Bakker, P.I., Ottoboni, L., Ag-

garwal, N.T., Piccio, L., Raychaudhuri, S., Tran, D., Aubin, C.,

et al; International MS Genetics Consortium. (2009). Meta-

analysis of genome scans and replication identify CD6, IRF8

and TNFRSF1A as new multiple sclerosis susceptibility loci.

Nat. Genet. 41, 776–782.

54. International Multiple Sclerosis Genetics Consortium

(IMSGC). (2010). Comprehensive follow-up of the first

genome-wide association study of multiple sclerosis identifies

KIF21B and TMEM39A as susceptibility loci. Hum. Mol.

Genet. 19, 953–962.

55. Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., Radford-

Smith, G.L., Ahmad, T., Lees, C.W., Balschun, T., Lee, J., Rob-

erts, R., et al. (2010). Genome-wide meta-analysis increases to

71 the number of confirmed Crohn’s disease susceptibility

loci. Nat. Genet. 42, 1118–1125.

56. Liu, X., Invernizzi, P., Lu, Y., Kosoy, R., Lu, Y., Bianchi, I.,

Podda, M., Xu, C., Xie, G., Macciardi, F., et al. (2010).

Genome-wide meta-analyses identify three loci associated

with primary biliary cirrhosis. Nat. Genet. 42, 658–660.

57. Hirschfield, G.M., Liu, X., Han, Y., Gorlov, I.P., Lu, Y., Xu, C.,

Lu, Y., Chen, W., Juran, B.D., Coltescu, C., et al. (2010).

Variants at IRF5-TNPO3, 17q12-21 and MMEL1 are associated

with primary biliary cirrhosis. Nat. Genet. 42, 655–657.

58. Stahl, E.A., Raychaudhuri, S., Remmers, E.F., Xie, G., Eyre, S.,

Thomson, B.P., Li, Y., Kurreeman, F.A., Zhernakova, A., Hinks,

A., et al; BIRAC Consortium; YEAR Consortium. (2010).

Genome-wide association studymeta-analysis identifies seven

new rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514.

59. Anderson, C.A., Boucher, G., Lees, C.W., Franke, A., D’Amato,

M., Taylor, K.D., Lee, J.C., Goyette, P., Imielinski, M., Latiano,

A., et al. (2011). Meta-analysis identifies 29 additional ulcera-

tive colitis risk loci, increasing the number of confirmed

associations to 47. Nat. Genet. 43, 246–252.

60. Hirschfield, G.M., Liu, X., Xu, C., Lu, Y., Xie, G., Lu, Y., Gu, X.,

Walker, E.J., Jing, K., Juran, B.D., et al. (2009). Primary biliary

cirrhosis associated with HLA, IL12A, and IL12RB2 variants.

N. Engl. J. Med. 360, 2544–2555.

61. Sun, J., Matthias, G., Mihatsch, M.J., Georgopoulos, K., and

Matthias, P. (2003). Lack of the transcriptional coactivator

OBF-1 prevents the development of systemic lupus erythema-

tosus-like phenotypes in Aiolos mutant mice. J. Immunol.

170, 1699–1706.

62. International Multiple Sclerosis Genetics Conssortium

(IMSGC). (2010). IL12A, MPHOSPH9/CDK2AP1 and RGS1

are novel multiple sclerosis susceptibility loci. Genes Immun.

11, 397–405.

63. Peluso, I., Pallone, F., and Monteleone, G. (2006). Inter-

leukin-12 and Th1 immune response in Crohn’s disease:

Pathogenetic relevance and therapeutic implication. World

J. Gastroenterol. 12, 5606–5610.

64. Remmers, E.F., Plenge, R.M., Lee, A.T., Graham, R.R., Hom, G.,

Behrens, T.W., de Bakker, P.I., Le, J.M., Lee, H.S., Batliwalla, F.,

et al. (2007). STAT4 and the risk of rheumatoid arthritis and

systemic lupus erythematosus. N. Engl. J. Med. 357, 977–986.

65. Jin, Y., Birlea, S.A., Fain, P.R., Gowan, K., Riccardi, S.L.,

Holland, P.J., Mailloux, C.M., Sufit, A.J., Hutton, S.M.,

Amadi-Myers, A., et al. (2010). Variant of TYR and autoimmu-

nity susceptibility loci in generalized vitiligo. N. Engl. J. Med.

362, 1686–1697.

66. Hunt, K.A., Zhernakova, A., Turner, G., Heap, G.A., Franke, L.,

Bruinenberg, M., Romanos, J., Dinesen, L.C., Ryan, A.W.,

Panesar, D., et al. (2008). Newly identified genetic risk variants

for celiac disease related to the immune response. Nat. Genet.

40, 395–402.

660 The American Journal of Human Genetics 90, 648–660, April 6, 2012

ARTICLE

Attenuated BMP1 Function Compromises Osteogenesis,Leading to Bone Fragility in Humans and Zebrafish

P.V. Asharani,1,10 Katharina Keupp,2,3,4,10 Oliver Semler,5 Wenshen Wang,1 Yun Li,2,3,4 Holger Thiele,6

Gokhan Yigit,2,3,4 Esther Pohl,2,3,4 Jutta Becker,3 Peter Frommolt,4,6 Carmen Sonntag,7,12

Janine Altmuller,6 Katharina Zimmermann,3 Daniel S. Greenspan,8 Nurten A. Akarsu,9

Christian Netzer,3 Eckhard Schonau,5 Radu Wirth,3 Matthias Hammerschmidt,2,4,7

Peter Nurnberg,2,4,6 Bernd Wollnik,2,3,4,11,* and Thomas J. Carney1,11,*

Bone morphogenetic protein 1 (BMP1) is an astacin metalloprotease with important cellular functions and diverse substrates, including

extracellular-matrix proteins and antagonists of some TGFb superfamily members. Combining whole-exome sequencing and filtering

for homozygous stretches of identified variants, we found a homozygous causative BMP1 mutation, c.34G>C, in a consanguineous

family affected by increased bone mineral density and multiple recurrent fractures. The mutation is located within the BMP1 signal

peptide and leads to impaired secretion and an alteration in posttranslationalmodification.We also characterize a zebrafish bonemutant

harboring lesions in bmp1a, demonstrating conservation of BMP1 function in osteogenesis across species. Genetic, biochemical, and

histological analyses of this mutant and a comparison to a second, similar locus reveal that Bmp1a is critically required for mature-

collagen generation, downstream of osteoblast maturation, in bone.We thus define themolecular and cellular bases of BMP1-dependent

osteogenesis and show the importance of this protein for bone formation and stability.

Introduction

Osteogenesis imperfecta (OI), also known as ‘‘brittle-bone

disease’’ is a rare genetic collagenopathy primarily charac-

terized by dramatically increased bone fragility causing

susceptibility to numerous fractures.1,2 Individuals often

show distinctive features including reduced bone mass,

short stature, blue sclerae, and/or dentinogenesis imper-

fecta. The severity of this disorder varies from profound

forms with intrauterine fractures and perinatal lethality to

milder phenotypic expression such as rare fractures or

even no fractures.3,4 Most OI cases are inherited in an auto-

somal-dominant manner and are caused by mutations in

COL1A1 (MIM 120150) and COL1A2 (MIM 120160);5,6

these two genes encode for the two a chains of collagen

type I, the predominant protein component of the bone

matrix. For a minority of OI individuals, an autosomal-

recessive inheritance is described as including underlying

mutations in CRTAP7 (MIM 605497), LEPRE18 (MIM

610339), SERPINH19 (MIM 600943), PPIB10 (MIM

123841), SP711 (MIM 606633), SERPINF112 (MIM 172860),

and FKBP1013 (MIM 607063). Mutations in COL1A1 or

COL1A2 result in primary structural or quantitative defects

of collagen I molecules, whereas genetic mutations causa-

tive for recessive forms mainly lead to defects in collagen I

biosynthesis. Defects in collagen I have been associated

with reduced bone mineralization and bone fragility not

only in individuals suffering from congenital OI but also

in elderly individuals who have developed osteoporosis.14

Type I collagen belongs to the fibril-forming collagens

and is composed of a triple helix consisting of two a1(I)

chains and one a2(I) chain. The a chains are synthesized

at the rough endoplasmic reticulum (ER). They are highly

post-translationally modified in the ER lumen and are

subsequently assembled to a triple helix. This premature

collagen helix, which contains globular appendages at

the amino (N-) and carboxyl (C-) ends, is then transported

through the trans-golgi network to the extracellular matrix

(ECM), where N- and C-proteinases catalyze proteolytic

cleavage of these propeptides. After this final processing

step, the released mature collagen I triple-helical monomer

can be assembled into highly ordered collagen fibrils.15,16

Bonemorphogenetic protein 1 (BMP1) is an astacin met-

alloprotease,17,18 the physiological function of which has

been of considerable interest for some time. This protein

has been suggested to play essential roles in osteogenesis

and ECM formation; it has also been described as exerting

influence over dorsal-ventral patterning through the indi-

rect activation of some TGFb-like proteins.19–21 Of partic-

ular interest has been the role of BMP1 in proteolytic

removal of the C-propeptides from procollagen precursors

of the major fibrillar collagen types I–III. This processing is

1Institute of Molecular and Cell Biology, Proteos, Singapore 138673, Singapore; 2Center for Molecular Medicine Cologne, University of Cologne, Cologne

D-50931, Germany; 3Institute of Human Genetics, University Hospital Cologne, University of Cologne, Cologne D-50931, Germany; 4Cologne Excellence

Cluster on Cellular Stress Responses in Aging-Associated Diseases, University of Cologne, Cologne D-50674, Germany; 5Children’s Hospital, University of

Cologne, Cologne D-50937, Germany; 6Cologne Center for Genomics, University of Cologne, Cologne D-50931, Germany; 7Institute of Developmental

Biology, University of Cologne, Cologne D-50674, Germany; 8Department of Cell and Regenerative Biology, School of Medicine and Public Health,

University of Wisconsin, Madison, Wisconsin 53706, USA; 9Department of Medical Genetics, Hacettepe University Medical Faculty, 06100 Ankara, Turkey10These authors contributed equally to this work11These authors contributed equally to this work12Present address: Australian Regenerative Medicine Institute, Monash University, Victoria 3800, Australia

*Correspondence: [email protected] (B.W.), [email protected] (T.J.C.)

DOI 10.1016/j.ajhg.2012.02.026. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 661–674, April 6, 2012 661

essential for the self-assembly of mature collagen mono-

mers into fibrils.22 The precise functional requirement of

BMP1 in vivo is unclear. To date, Bmp1 loss of function

has only been analyzed in a knock-out mouse, in which

it was found to be lethal around birth and for which no

detailed analysis of osteogenesis was presented.23 Thus,

the role of BMP1 in bone formation and organogenesis

remains obscure.

Here, we describe two siblings with a high-bone-density

form of OI, identify a causative mutation in BMP1, and

provide detailed analysis of the effect of the mutation on

BMP1modification and secretion. We show that the hypo-

functional nature of the mutation is demonstrated in vivo

by using two assays that test substrate cleavage in zebra-

fish. We further describe a zebrafish Bmp1a mutant with

skeletal defects comparable to those seen in individuals

with OI, demonstrating conservation of important BMP1

function in osteogenesis across species. Our analysis of

these mutants has demonstrated that loss of Bmp1a affects

neither osteoblast formation nor activity but rather the

ability to generate mature collagen fibrils. This finding

therefore indicates that BMP1 is very much required in

the process of bone formation.

Material and Methods

Whole-Exome SequencingGenomic DNA was enriched from exonic and adjacent splice-site

sequences by the use of the Agilent SureSelect Human Exome

Kit and run on the Illumina Genome Analyzer IIX Sequencer.

Further data analysis was performed with an in-house bioinfor-

matics pipeline in combination with SAMTOOLS v.0.1.7 for SNP

and indel detection. In-house-developed scripts were applied for

the detection of protein changes, splice-site affections, and over-

laps with known variations (Ensembl build 61 and 1,000 Genomes

Project release 2010_3).

Mutation ScreeningThe identified mutation was resequenced in an independent

experiment, tested for cosegregation with the phenotype within

the family, and then screened in 300 healthy control individuals

from Turkey by PCR and restriction digestion (AvaI; Fermentas,

St. Leon-Rot, Germany). All subjects or their legal representatives

gave written informed consent for the study. The study was per-

formed in accordance with the Declaration of Helsinki protocols

and was approved by the local institutional review boards.

Generation of BMP1 ConstructsTheBMP1-FLAGpcDNA3.1 construct contained full-lengthhuman

cDNA of BMP1 (RefSeq accession number NM_001199.3) and was

fused to a C-terminal FLAG tag; it was provided by the group of

Karl E. Kadler (University of Manchester, UK) and was used for

BMP1 expression studies. The identified substitution, p.Gly12Arg,

was introduced by site-directed PCR mutagenesis with the use of

a primer containing the specific nucleotide substitution.

Cell Culture and Transient TransfectionHuman embryonic kidney (HEK) 293T cells were cultured in

Dulbecco’s modified Eagle’s medium (DMEM) containing 10%

fetal bovine serum (FBS, GIBCO) and antibiotics. Cells were

transiently transfected with Lipofectamine 2000 (Invitrogen,

Karlsruhe, Germany) and vectors containing wild-type (WT) and

mutant variants of BMP1 cDNA. Transfections were performed

according to the manufacturer’s instructions.

Secretion Assay and Immunoblot Analysis30 hr after transient transfection, cells and untransfected control

cells were maintained in serum-free medium for 18 hr at 37�C.After starvation, supernatant-containing secreted proteins were

precipitated by trichloroacetic acid, and cells were lysed with ice-

cold lysis buffer. The total protein concentration of the extracts

was determined by the BCA (bicinchoninic acid) Protein Assay

Kit (Pierce Protein Research Products, Thermo Fischer Scientific,

Rockford, IL, USA), and proteins were separated by gradient

(4%–12%) SDS-PAGE (Invitrogen) under reducing conditions

and transferred to a nitrocellulose membrane by immunoblotting.

Immunoblots were blocked in 5% milk powder in TBS containing

0.1% Tween20, and they were probed with Flag antibody (Agilent

Technologies, Waldbronn, Germany). Equal protein amounts

were confirmed by b-actin detection in whole-cell lysates or by

Coomassie staining of a ~65 kDa protein in supernatant of

serum-free medium. A peroxidase-conjugated secondary antibody

(goat anti-mouse) was purchased from Santa Cruz Biotechnology

(Santa Cruz, CA, USA), and blots were developed with an

enhanced chemiluminescence system, ECL Plus (Amersham,

UK); exposure on autoradiographic film (GE Healthcare, Mun-

chen, Germany) followed.

Zebrafish StudiesRadioimmunoprecipitation (RIPA) buffer (50 mM Tris-HCl, pH

7.6, 150 mM NaCl, 0.1% SDS, 0.5% sodium deoxycholate, and

1% NP-40) was used for the extraction of total protein from zebra-

fish larvae (6 days old) or fins (4 months old), and the protein was

measured with a standard Bradford assay. Proteins (20 mg) were

separated on a 6% denaturing polyacrylamide gel, transferred to

a nitrocellulose membrane, and probed with a rabbit polyclonal

antibody raised against a peptide of zebrafish Collagen1a1a.

Goat anti-rabbit antibody conjugated with horseradish peroxidase

(HRP) was employed as a secondary antibody, and the bands were

visualized with chemiluminescence detection (Millipore, Billerica,

MA, USA). For the loading control, blots were stripped and then

reprobed with a rabbit anti-b-actin antibody (Cell Signaling

Technology, Danvers, MA, USA).

N-glycosidase AssayFor the N-glycosylation studies, 20 mg of whole-cell lysates of

transiently transfected and untransfected HEK 293T cells were

either treated with N-glycosidase F (PNGaseF) (New England

BioLabs, Frankfurt, Germany) or left untreated. The enzyme reac-

tion was performed according to the manufacturer’s instructions.

Proteins were subjected to SDS-PAGE and analyzed by anti-Flag

immunoblotting. Equal protein amounts were confirmed by

b-actin detection.

Fish Lines and MappingEmbryos were obtained by natural crosses and were staged as per

Parichy et al.24 Themicrowaved (med tt281), dino (chd tt250), and frilly

fins (frf tm317a, frf tf5, frf tp34, and frf ty68) alleles that we used have

been described previously25 and were isolated in an N-ethyl-N-

nitrosourea (ENU) screen in Freiburg (frf fr24). frf tm317 was used

662 The American Journal of Human Genetics 90, 661–674, April 6, 2012

for all analyses. The Tg(sp7:mCherry) transgenic line has been re-

ported previously.26 We performed genetic mapping by crossing

frfþ/� and medþ/� fish to the WIK strain and by subjecting the

F2 progeny to simple-sequence-length polymorphism (SSLP)

meiotic mapping as outlined by Geisler.27 Heat shocks were per-

formed at 37�C for 1 hr at 30 and 56 hr postfertilization (hpf).

MicroscopyFluorescent images were taken on an Olympus Fluoview confocal

microscope, whereas brightfield and Nomarski micrographs were

taken on a Zeiss Axioimager or a Leica MZ16FA. For live imaging,

larvae were anesthetized in Tricaine and mounted in 3% methyl

cellulose or 1% low-melting-point agarose. All whole-mount stain-

ings using alizarin red, in situ hybridization, or immunodetection

were cleared in glycerol prior to mounting. Stained ultrathin

sections of the fins were employed for transmission electron

microscopy (TEM) on a Jeol JEM-1010 electron microscope.

RNA and DNA Isolation, cDNA Synthesis,

and SequencingTrizol (Invitrogen, CA, USA) was used for the isolation of RNA

from WT or mutant larvae, and SuperscriptIII Reverse Transcrip-

tase (Invitrogen, CA, USA) was used for the generation of cDNA

by reverse transcription. RT-PCR was used for the amplification

of bmp1a cDNA from all frilly fins mutated alleles and correspond-

ing siblings. Similarly, col1a1a cDNA was amplified from the

microwaved mutant allele and siblings. Resulting PCR fragments

were purified and sequenced directly. To identify mutations at

the genomic level, we extracted larval genomic DNA and directly

sequenced the region of interest from amplified PCR products.

DNA Construct GenerationGateway cloning technology and entry constructs from the

Tol2Kit28 were used for the generation of zebrafish and Myc-

tagged human Bmp1 heat-shock constructs. Bmp1 coding regions

were cloned into a middle entry vector via standard cloning

methods or through a BP Gateway reaction.

DNA and RNA InjectionsDNA and RNAs were diluted in Danieau buffer and Phenol

red before being injected into 1-cell embryos by a Pico-Injector

(Harvard Apparatus, MA, USA). Sense RNAs for overexpression

or rescue were transcribed from cDNAs and cloned into pCS2þ.Plasmids were linearized with NotI, and capped mRNA was

synthesized with mMessage mMachine SP6 kit (Ambion, Applied

Biosystems, Austin, TX, USA). Full-length cDNA for BMP1 was

amplified from a HeLa-cell cDNA library and cloned into pCS2þ.A primer harboring the p.Gly12Arg signal-peptide substitution

was used for the generation of the mutant BMP1 cDNA version

by PCR. chordin RNA was transcribed as previously reported.29

chordin and BMP1 RNA were injected at a concentration of 60

and 450 pg, respectively. To generate adult dino mutants, we

rescued early patterning defects by injecting 30 pg of chordin

RNA. Tol2 RNA was generated as published.30

In Situ HybridizationSingle- and double-stranded-RNA in situ hybridizations were

performed and developed with either chromogenic substrate31

or fluorescent tyramide signal amplification.32 Probes for sp7,

osteopontin, and collagen10a1 were synthesized as described.33

The bmp1a probe was generated with EcoRI by linearizing clone

IMAGp998H0417161 from ImaGenes (Berlin, Germany) and was

transcribed with T7 Polymerase. In situ hybridization of 8 dpf

(days postfertilization) larvae required extended 40min proteinase

K (15 mg/ml) digestions at room temperature, 48 hr hybridization

in the antisense probe, and 2 days of signal development at room

temperature.

Skeletal and Matrix StainingBones of larvae and adult fish were stained with alizarin red alone

or in combinationwith the cartilage stain, alcian blue, as described

by Walker et al.34 and Spoorendonk et al.26 For microscopic anal-

ysis of fibrillar collagen organization, fins and larvae were fixed in

4% paraformaldehyde and embedded in 1% agarose for cryosec-

tioning. Sections were stained with picrosirius red as previously

described35 and were visualized by birefringence under polariza-

tion filters.

Antibody StainingMyc-tagged BMP1 proteins were detected by whole-mount

immunofluorescent staining with the 9E10 monoclonal antibody

(Santa Cruz, CA, USA). 3 dpf embryos were fixed in 4% paraformal-

dehyde overnight at 4�C and were then washed with 0.1% PBS

Triton X-100. Embryos were permeabilized by incubation in

100% acetone for 7 min at �20�C, were rewashed, and were

blocked overnight in PBS/Triton with 0.5% goat serum and 0.1%

dimethyl sulfoxide. After extensive washing, the embryos

were incubated with Alexa488-conjugated secondary antibodies

(Invitrogen) diluted in blocking solution. Finally, embryos were

rewashed before they were cleared in glycerol. Osteoblasts of

the adult fins were stained with the zns5 monoclonal antibody

(obtained from the Zebrafish International Resource Center

[Eugene, OR, USA]) either in the wholemount or after cryosection-

ing. Theywere then counterstainedwith either alizarin red (for the

whole mounts) or DAPI (for the cryosections) as per previous

methods.36

Retinoic-Acid TreatmentWe purchased all-trans retinoic acid (RA) from Sigma (MI, USA),

and we made a 1 mM stock solution by dissolving it in ethanol.

Larvae were treated with 1 mM RA diluted in egg water at 4 dpf.

The controls were exposed to an equivalent amount of just the

carrier (ethanol). Fresh solution was added every other day until

the fish were fixed at 11 dpf for alizarin-red processing.

MicroCT Scanning of Adult FishAdult fish were euthanized in Tricaine and scanned immediately

at 40 kV, 130 mAwith a Siemens Inveon PET-CT (positron emission

tomography-computed tomography) scanner. The images were

reconstructed and analyzed with Inveon Acquisition Workplace

1.4 and Inveon Research Workplace 3.0, respectively. Bone

mineral density was calculated with the density-phantoms stan-

dards supplied by the company.

Results

A Homozygous BMP1 Mutation Causes High Bone

Mineral Density and Multiple Fractures

Weusedwhole-exome sequencing to identify the causative

mutation underlying an autosomal-recessive form of bone

fragility in a consanguineous family from Turkey. Both

The American Journal of Human Genetics 90, 661–674, April 6, 2012 663

affected individuals (Figures 1A and 1B and Table 1) pre-

sented with multiple fractures after minimal trauma

occurred in their second year of life. Interestingly, despite

recurrent fractures, bone-density measurements showed

values high above the normal range in both individuals

(Figures 1A and 1B). The male index individual had >15

fractures before bisphosphonate treatment was initiated;

this treatment was administered on the basis of the

hypothesis that the individuals had an OI-like disease

with a high rate of fractures, including vertebral fractures,

as a result of impaired bone material with high production

rates and high bone turnover. Osteoclastic activities were

elevated and measured by deoxypyridinoline excretion.

These clinical and biochemical findings were not typical

of a classical form of osteopetrosis. After treatment began,

we observed increased bone mass, reduced fractures, and

improved vertebral structures. When the bisphosphonate

treatment was completed, fracture rates again began to

increase. Increased bone mass and a reduced fracture rate

were also observed in his affected sister while she was

Figure 1. Two Siblings with Autosomal-Recessive Bone Fragility and High-Bone-Mass Phenotype(A and B) Clinical data of both individualsare shown. Above, diagrams illustrate Zscores of bone-mineral-density measure-ments of the head and vertebrae L2–L4,respectively, indicating highly increasedlevels of bone mineral density. X-rays showfractured and bent forearms (below) andspinal columns (right) of individuals. High-radiation X-rays of the forearm and spinalcolumn of individual 1 indicate an intensebone density. Vertebrae of individual 2 areflattened and irregularly formed (B).

undergoing bisphosphonate therapy.

High Z scores were seen before and

during therapy.

The exome of the proband was

enriched by the Agilent SureSelect

Human Exome kit and was run on

an Illumina Genome Analyzer IIX.

Over 90% of the exonic sequences

had coverage of at least 203 (Fig-

ure 2A), and the mean coverage was

723. We took advantage of the

parental consanguinity (Figure 2B),

and we used linkage analysis to deter-

mine larger stretches of homozygosity

in the exome by using identified

variants throughout the exome as

haplotype blocks. In addition to

filtering variants for their location

within the identified stretches of

homozygosity, we considered those

variants that were not annotated in

dbSNP132 or the 1,000 Genomes

Database to be possibly causative; this reduced the number

of putative variants to three (Table S1, available online).

The relevant alteration was the homozygous c.34G>C

substitution located in the excellent functional candidate

gene, BMP1. Sanger sequencing confirmed that both

affected individuals were homozygous for the c.34G>C

mutation, whereas both parents were heterozygous. In

addition, it was detected neither in 300 healthy Turkish

control individuals nor in over 2,400 exomes covering

the c.34G>C position (Exome Variant Server, National

Heart, Lung, and Blood Institute Exome Sequencing

Project [ESP], Seattle, WA; Figure 2C).

Functional Effects of the p.Gly12Arg Substitution

in BMP1

The mutation is predicted to substitute arginine for a

conserved glycine residue (p.Gly12Arg) within the signal

peptide of BMP1 (Figure 2D); this signal peptide is essential

for the protein’s localization to the ER, correct posttransla-

tional glycosylation, and secretion.37 Indeed, we found

664 The American Journal of Human Genetics 90, 661–674, April 6, 2012

that in contrast to the Flag-tagged WT BMP1 that is

transiently expressed in HEK 293T cells, the p.Gly12Arg

signal-peptide variant BMP1 showed a drastically reduced

secretion capacity (Figure 3A). Moreover, we detected a

predominant additional lower-molecular-weight band for

mutant BMP1 in immunoblots from cell lysates. Results

from N-glycosidase treatment of WT and mutant-BMP1-

transfected HEK 293T cells indicated that the lower band

in untreated lysates represented a nonglycosylated form

of BMP1 (Figure 3B). Interestingly, deficits in BMP1 glyco-

sylation can also negatively impact secretion.37 These data

thus indicate that p.Gly12Arg BMP1 is inefficiently

secreted and has diminished posttranslational glycosyla-

tion, which might contribute to its impaired secretion.

To assess whether the amino acid substitution in the signal

peptide causes a reduction in extracellular proteolytic

Table 1. Clinical Features of Both Individuals

Findings Individual IV:1 Individual IV:2

Age at first visit (years) 5.0 1.9

Age at last visit (years) 11.4 7.5

Age at start of bisphosphonate treatment (years) 5.4 2.8

Age at end of bisphosphonate treatment (years) 10.4 7.5

Birth length and birth weight normal normal

Confirmed prenatal fractures none none

Age at first fracture (months) 23 14

Color of sclera white white

Dentinogenesis imperfecta no no

Hypermobility of joints no no

Cardial impairments none none

Hearing impairment no no

Old fractures of extremitiesa yes yes

Vertebral fracturesa yes yes

Bowing of upper extremitiesa no no

Bowing of lower extremitiesa antecurvation of both tibiae no

Shortening of upper extremitiesa no no

Shortening of lower extremitiesa no no

Weight at first visit in kg/BMI (SD) 23.6 (þ1.8) 10.5 (�0.4)

Weight at end of bisphosphonate treatment in kg/BMI (SD) 44.0 (þ1.7) 22.0 (þ0.8)

Height at first visit in cm (SD) 112.0 (þ0.1) 82.3 (�0.7)

Height at end of bisphosphonate treatment in cm (SD) 139.0 (�0.6) 112.2 (�2.6)

Retarded gross motor functions no no

Mobility at first visit (BAMF score) 8 7

Mobility at last visit (BAMF score) 9 9

Intelligence normal normal

Calcium levela (mmol/l) [range] 2.43 [2.20–2.65] 2.28 [2.20–2.65]

Alkaline phosphatase at first visit (U/l) [range] 133 [<269] 107 [<281]

Alkaline phosphatase at last visit (U/l) [range] 116 [<300] 114 [<300]

Procollagen-1-C-peptidea (marker for osteoblastic activity) (mg/l) [range] 170 [193–716] 141 [225–676]

Deoxypyridinoline/creatinine (marker for osteoclastic activity) at first visit (nM/mM)[mean 5 SD]

58.66 [16.5 5 5.0] 63.9 [19.5 5 7.2]

Deoxypyridinoline/creatinine (marker for osteoclastic activity) at end of bisphosphonatetreatment (nM/mM) [mean 5 SD]

17.6 [14.2 5 5.3] 29.9 [19.5 5 7.2]

The following abbreviations are used: SD, standard deviation; BMI, body mass index; and BAMF, brief assessment of motor function.aAt first presentation.

The American Journal of Human Genetics 90, 661–674, April 6, 2012 665

activity, we analyzed dorsal-ventral patterning of the ze-

brafish embryo—a process which is extremely sensitive

to levels of the Bmp1 target, Chordin—as an in vivo assay

for BMP1 function. Injection of WT BMP1 RNA evoked

a mild ventralization of embryos, whereas the mutant

RNA had quantitatively reduced activity (Figures 3C–3E

and 3I). Exogenous chordin RNA dorsalizes embryos

(Figures 3F and 3I), an effect efficiently reversed by WT

BMP1 RNA but not by the mutant RNA (Figures 3G–3I).

These data demonstrate that the p.Gly12Arg substitution

compromises BMP1 activity in vivo and imply that the

human enzyme can cleave the targets of its zebrafish

counterpart.

Phenotypic Characterization of the Zebrafish frilly fins

Mutant

We analyzed the zebrafish frilly fins (frf�/�) mutant, which

was initially described as causing a phenotype25 character-

ized by a ruffled larval fin (Figures 4A and 4B) as well as a

shortened body axis andmalformed craniofacial structures

and fin shape (Figures 4C and 4D). We observed a striking

reduction in ossification of vertebrae from 6 dpf to 11 dpf

(Figures 4E and 4F; data not shown). The osteopenia

persisted in such a manner that at 15 dpf, the anterior

vertebrae had partially ossified but were misshapen

(Figures 4I and 4J). The fact that overall length and

growth of mutant larvae were not reduced at this stage

argues against a general developmental delay (data not

shown). By 25 dpf, all frf mutant vertebrae appeared to

have ossified (Figures S1E and S1F), but fusions could be

seen between some vertebrae (Figures S1G and S1H). Fins

appeared hypomorphic at all stages, and delayed develop-

ment of bony rays (lepidotrichia) appeared at 25 dpf

(Figures S1E and S1F). Adult fins had reduced numbers of

lepidotrichia, which appeared wavy, underwent limited

bifurcation, and were often fused to adjacent rays (Figures

4K and 4L). The presence of calluses is suggestive of spon-

taneous fracturing during fin outgrowth (Figure 4L). We

generated maternal-zygotic mutant embryos from five

alleles; the fact that none were dorsalized indicates that

Bmp1a is dispensable for dorsal-ventral patterning (data

not shown).

Figure 2. Whole-Exome Sequencing and Filtering Identify Mutation in BMP1(A) Statistical overview of target-base coverage during sequencing process. Over 90% of identified variations were covered morethan 203.(B) Pedigree structure of the consanguineous Turkish family.(C) Sequence chromatograms of the identified c.34G>C BMP1mutation predicted to substitute the glycine at position 12 with arginine.The c.34G>Cmutation was found to be heterozygous (middle panel) in both parents and homozygous in both individuals (right panel).(D) Schematic view of BMP1 domain structure. The locations of identified mutations in humans (above) and zebrafish (below) areshown. Note that the frf tf5 mutation generates multiple splice isoforms. The following abbreviation is used: SP, signal peptide.

666 The American Journal of Human Genetics 90, 661–674, April 6, 2012

A second zebrafish fin mutant, microwaved (med),

displays a phenotype similar to that of frf mutants—de-

layed ossification at 11 dpf (Figures 4G and 4H) and undu-

lation of the larval fin (Figures S1I and S1J).25 As with frf

mutants, all vertebrae of med mutants eventually ossify

(Figures S1A–S1D). To compare the two mutants in more

detail, we quantified bone density in the adults by mi-

croCT analysis (Figures 5A–5D) and assessed both vertebral

and lepidotrichial bone. Both frf andmed distal fin rays had

reduced bone density when they were compared to their

respective siblings, but the bone of the vertebrae had diver-

gent phenotypes. Like the lepidotrichial bone, the verte-

brae of med mutants also had reduced bone density;

surprisingly, however, frf-mutant adult vertebrae had

increased bone density (Figure 5E) similar to that seen in

the human individuals with the BMP1 mutation, whereas

med mutants displayed traits similar to those seen in indi-

viduals with classical OI. Meiotic mapping revealed close

linkage between the med locus and zebrafish col1a1a on

linkage group 3 (data not shown). Sequencing of col1a1a

cDNA (GenBank accession number BC063249.1) from

med�/� mutants identified a G>A transition predicted to

substitute a highly conserved glutamic acid with a lysine

at position 888 (p.Glu888Lys; Figures S1K–S1N). Thus,

the microwaved mutant constitutes a zebrafish model of

classical OI.

The Zebrafish frilly fins Mutant is Caused by a bmp1

Mutation

We next mapped the frf mutant to an interval that con-

tains bmp1a on linkage group 8 (Figure S2A). Sequencing

of the bmp1a cDNA (GenBank accession number

BC163535.1) from five frf alleles identified two missense

mutations causing the substitutions p.Ile124Asn and

p.Val223Asp (Figure 2D and Figures S2F and S2H), which

are within the protease domain and which affect con-

served amino acids (Figures S2K and S2L); a nonsense

mutation that truncates the protein at the end of

the proteolytic domain (p.Tyr306*; Figure 2D and Fig-

ure S2G); and two splice-site mutations, one (associated

with frf tm317) leading to the deletion of 21 amino acids

from the proteolytic domain (p.Gln290_Arg310del; Fig-

ure 2D and Figures S2B, S2C, and S2I) and another (associ-

ated with frf tf5) generating four main erroneous splicing

products (p.Tyr378Serfs*6, p.Lys390Serfs*6, p.Gly395ins-

GlyLeuArg*, and p.Lys394_Gly395ins8; Figure 2D and

Figure 3. A Signal-Peptide Substitution inBMP1 Causes Secretion and GlycosylationDefects In Vitro and Loss of Protease ActivityIn Vivo(A) Immunoblot of HEK 293T cells transfectedwith either Flag-tagged WT BMP1 (BMP1-Flag;lanes 1 and 4) or p.Gly12Arg-substituted BMP1(BMP1mut-Flag; lanes 2 and 5) and untransfectedcontrol cells (control; lanes 3 and 6). Immuno-blotting shows that the p.Gly12Arg protein (lanes1–3) isolated from cell lysates had increasedmobility; reduced amounts of this BMP1 proteinwere secreted into the medium (lanes 4–6).(B) Immunoblot of lysates of HEK 293T cellstransfected with either Flag-tagged WT BMP1(BMP1-Flag; lanes 1 and 2) or p.Gly12Arg-substituted BMP1 (BMP1mut-Flag; lanes 3 and 4)and untransfected controls (control; lanes 5 and6). After being harvested, lysates were eithertreated with N-glycosidase (lanes 2, 4, and 6) orleft untreated (lanes 1, 3, and 5). The predomi-nant mutant-BMP1 band with increased mobilitymigrates at the same rate as deglycosylated WTBMP1.(C–I) The p.Gly12Arg-substituted BMP1 exhibitsreduced Chordinase activity in vivo. Lateral viewsof uninjected 24 hpf zebrafish embryos (C) andembryos injected with RNA encoding eitherWT BMP1 (BMP1; D and G) or p.Gly12Argsignal-peptide variant BMP1 (BMP1mut; E and H).Chordinase activity was assessed by its ability toventralize WT embryos (C–E) or rescue dorsalized(chordin RNA injected) embryos (F–H). In bothassays, the mutant BMP1 showed reduced abilityto counteract either the endogenous or exoge-nous Chordin (quantified in I).

The American Journal of Human Genetics 90, 661–674, April 6, 2012 667

Figures S2D, S2E, and S2J). We were able to rescue the frf

larval fin phenotype by injecting a DNA construct driving

zebrafish Bmp1a expression from a heat-shock promoter,

further supporting the conclusion that frilly fins represents

a bmp1a mutant (Figures S3A–S3C).

Bmp1 Function in Bone Formation and Development

To understand the function of Bmp1 in bone formation,

we characterized zebrafish bmp1a gene expression and

the frf mutant phenotype in more detail. Zebrafish

bmp1a is expressed in osteoblasts. During larval stages,

we observed strong expression in fin mesenchyme cells

within the fin fold at a stage contemporaneous with the

appearance of the fin-fold defect in frf mutants (Figures

6A and 6B). In addition, we noted expression in the floor

plate and hypochord, branchial arches and operculum

(Figures 6A and 6C), and sites of bone formation in the

head. We confirmed expression of bmp1a in osteoblasts

by double-fluorescent in situ hybridization with the zebra-

fish osteoblast maker col10a1. We found consistent overlap

of both mRNAs in osteoblasts on the operculum and

cleithrum (Figures 6E and 6E00; data not shown). We also

noted expression in clusters of cells arranged metameri-

cally adjacent to the notochord, a location and arrange-

ment consistent with osteoblasts of the vertebral column

(Figure 6D).

To determine whether the compromised ossification in

frilly fins mutant larvae is due to insufficient generation

of osteoblasts, we analyzed expression of sp7, osteopontin,

and collagen10a1 by in situ hybridization in frf mutants.

Osteoblasts were found to be normal in number and

location, consistent with the interpretation that loss of

bmp1a does not disrupt osteoblast generation, number,

localization, or differentiation (Figures 6F–6K). We con-

firmed the presence of normal osteoblast numbers by first

crossing frf into the sp7:mcherry transgenic line,26 which

labels osteoblasts on skeletal structures. Compared to WT

cells, mCherry-positive cells did not decrease in number

in the frfmutant vertebrae or fins, even at the earliest times

that such cells are visible (Figures 6L and 6Mand Figure S4A

and S4B). We obtained a similar result by immunostaining

with the osteoblast-specific zns5 antibody, although we did

observe altered cellular morphology—osteoblasts appeared

more cuboidal in the mutant than did the flattened cells in

the WT bone (Figures 7A–7D). The latter effect was also

seen in the lepidotrichia when imaged by transmission

electron microscopy (Figures 7E and 7F). However, no

significant change in osteoblast number was observed

in fin rays (Figure S4C). To test whether this altered-

morphology phenotype is concomitant with a loss of oste-

oblast activity, we exposed frf mutants to RA, which was

previously described as stimulating osteoblast activity

and causing precocious hyperossification of the entire

vertebral column.26,33 Unlike in WT (Figures 7G and 7H),

ossification in frf mutants was almost completely refrac-

tory to RA treatment (Figures 7I and 7J), consistent with

a defect in the ability of osteoblasts to effectively generate

osteoid, downstream of osteoblast differentiation and

activity.

Bmp1 could potentially affect the ossification process

through proteolytic cleavage processes involving a number

of targets. Among these, its ability to cleave and inactivate

the BMP2/4 inhibitor Chordin is well documented and has

the potential to affect the ossification process.38,39 To test

whether the frf phenotypes could be due to an excess of

uncleaved Chordin, we generated frf�/� chordin�/�

double-mutant adults (in which the early ventralized

phenotype was rescued by chordin mRNA injection). As

previously described,40 Chordin function is dispensable

after gastrulation for axial skeleton generation and

patterning (Figures S5A and S5C), yet it remains possible

that elevated levels might perturb osteogenesis. However,

Chordin loss at late larval stages failed to rescue the frf

Figure 4. The Zebrafish frilly finsMutant Displays Larval Fin-FoldRuffling and Osteogenesis Defects(A and B) Ventral view of the posterior medial fin fold at 3 dpf ina frilly fins (frf) (B) larva showing undulations in the fin fold, whichnormally has a linear morphology (A).(C and D) Compared to the siblings, 4-month-old frf �/� adults (D)are short and display axis defects, body curvature, and fin andcraniofacial dysmorphogenesis.(E–L) Alizarin-red staining of frf �/� (F, J, and L),microwaved (med�/�)(H), andWTsiblings (E, G, I, and K) at 11 dpf (E–H), 15 dpf (I and J),and 4 months (K and L). Both frf andmed display reduced ossifica-tion of the vertebrae (F and H), whereas nascent vertebrae areosteopenic and dysmorphic (J). Tail fins in frf �/� have lost theWT fan shape (K) and display fracture calluses, reduced bifurca-tions (L), and crinkled lepidotrichia, which often fuse to eachother (L; inset).

668 The American Journal of Human Genetics 90, 661–674, April 6, 2012

defect (Figures S5B and S5D). In addition to Chordin

cleavage, Bmp1 also plays a role in generating mature

Collagen I through the cleavage of the C-terminal propep-

tide domain (Figure 8A). Indeed, the inability of RA to

rescue the ossification process in frf mutants and the simi-

larity between the frf and med vertebrae phenotypes

suggest that the major requirement for Bmp1 in bone

formation is the generation of mature Collagen I. Using

Figure 5. CT Analysis of frilly fins and microwaved Bone Reveals Altered Adult Bone Densities(A–D) microCT analysis of bone density in a 7-month-old frf�/� mutant (B) and its WT sibling (A) as well as a med�/� mutant (D) and itsWT sibling (C). Note that although themedmutant looks overtly normal (D), the frfmutant skeleton displays axial curvature and defectsin the head skeleton (B).(E) Box plots of density measurements derived from microCT analysis of mutants and sibling vertebrae and fin lepidotrichia. medmutants have osteopenia of both lepidotrichia in the fins and vertebrae (blue boxes). Although frf mutants also display osteopenia ofthe fins, they show an increased bone density in the vertebrae (green boxes). The Mann-Whitney U test was performed for comparingdensities between mutants and siblings for each bone type (*** denotes p < 0.001; ** denotes p < 0.01; and n ¼ 8 for all data sets). Boxesindicate the median and the 25th and 75th percentiles, whereas the whiskers display the largest and smallest values.

Figure 6. bmp1a Is Expressed in Osteoblasts, which Appear Normally Differentiated in frilly fins(A–D) In situ hybridization of bmp1a at 4 dpf (A), 2 dpf (B), 3 dpf (C) and 8 dpf (D). Expression is seen in fin mesenchyme cells of thefin fold (A and B), floor plate and hypochord (A, open and filled arrowheads, respectively), branchial arches (C, asterisk), and operculum(C, arrowhead) and perichordal cells of the anterior notochord (D, arrowheads).(E–E00) Confocal images of double-fluorescent in situ hybridizations showing coexpression of bmp1a (E and E00; red) and the osteoblastmarker collagen10a1 (E0 and E00; green) in osteoblasts on the operculum at 4 dpf. Most cells express both markers (three cases highlightedby arrowheads), and central, more mature osteoblasts express slightly higher levels of col10a1.(F–K) Perichordal expression of osteoblast markers is not disrupted in frf mutants. Lateral images of anterior notochord of 8 dpf WT(F, H, and J) and frf�/� (G, I, and K) larvae hybridized with probes for sp7 (F and G), osteopontin (H and I), and collagen10a1 (J and K).(L and M) Confocal images of sp7:mCherry-expressing osteoblasts on vertebrae of WT (L) and frfmutant (M) larvae at 20 dpf. There is noreduction in osteoblast numbers in the mutant.

The American Journal of Human Genetics 90, 661–674, April 6, 2012 669

the picrosirius-red staining method to enhance birefrin-

gency of highly ordered fibrillar collagen,35 we analyzed

collagen fibrillogenesis. At all stages analyzed and in both

fins and vertebrae, we found a significant reduction in

birefringence in frf�/� mutants; this indicates a loss of

fibrillar-collagen structure (Figures 8B–8G). Ultrastructural

imaging of fibrillar collagen in the fins at 6 dpf by transmis-

sion electron microscopy revealed disruption to the

normal periodic collagen fibril (Figure 8H) in the mutant

(Figure 8I). Immunoblot analysis, employing an antibody

raised against zebrafish Col1a1a, demonstrated compro-

mised C-propeptide removal in frf�/� mutants at both

larval and adult stages (Figures 8J and 8K).

Because the frf larval fin phenotype is most likely due to

loss of Bmp1-mediated collagen processing, we used a test

for such processing as an in vivo assay to assess the activity

of BMP1 with the p.Gly12Arg substitution. Toward this

end, we injected into frf embryos DNA constructs contain-

ing a heat-shock promoter upstream of either human WT

BMP1 coding sequences or a mutant version bearing the

p.Gly12Arg substitution. This assay demonstrated that

BMP1 with the p.Gly12Arg substitution had a significantly

reduced ability to rescue the fin defect (Figures S3A–S3E);

this finding suggests that p.Gly12Arg causes a reduced

ability to augment in vivo the deficit in frf C-propeptidase

activity.

Discussion

In this study, we identify a homozygous missense muta-

tion substituting an amino acid in the signal peptide of

BMP1 in a Turkish consanguineous family with auto-

somal-recessive high-bone-density OI. BMP1 is known to

have several functions in different pathways, including

the proteolytic processing of the procollagen I C-propep-

tide for the generation of mature collagen type I. On the

basis of this information and given the fact that the

majority of OI cases are associated with defects in COL1A

genes or in genes involved in collagen I biosynthesis,3,5

we propose BMP1 as a highly relevant gene in this

OI-disease context.

The identified substitution, p.Gly12Arg, is located in

exon 1, which encodes the signal peptide of BMP1.

Although it remains to be determined whether the sig-

nal-peptide variant results in a reduction of BMP1 enzy-

matic activity per se, bioinformatic analysis predicted a

disruption of the signal peptide and therefore suggested a

failure of intracellular protein sorting of premature BMP1

and an exclusion from the secretory pathway. Subse-

quently, we confirmed in in vitro assays that the amino

acid substitution leads to severely reduced post-transla-

tional N-glycosylation of the mutant protein and impaired

protein secretion (Figures 3A and 3B). Our results agree

Figure 7. Osteoblasts in frilly fins Mutants Have Altered Morphology but Cannot Hyperossify upon RA Treatment(A–D) Immunohistochemical (A and B) and immunofluorescent (C and D) staining of osteoblasts with the zns5 antibody (brown stainin A and B; green stain in C and D) in WT (A and C) and frf�/� (B and D) fins at 120 dpf. (A) and (B) display lateral views of fin rayscounterstained with alizarin red (bone is in red), whereas (C) and (D) are transverse sections of fin rays counterstained with DAPI(blue). There is no loss in number of zns5þ osteoblasts in frf�/� mutants. However, the osteoblasts display an altered morphologywhen viewed in cross section; they appear more cuboidal where they are normally flat cells that maintain intimate contact with thebone surface (arrowheads in C and D).(E and F) Electron micrographs of transverse sections of adult fin rays. Osteoblasts are indicated with red arrowheads and appear flat inWT fins (E) yet more cuboidal in frf�/� fins (F).(G–J) Alizarin-red staining of 11 dpfWT (G and H) and frf�/� larvae (I and J) treated with (H and J) or without (G and I) RA for enhancingosteoblast activity. Despite normal numbers of differentiated osteoblasts in frf mutants, these cells are unable to mineralize thenotochord efficiently upon RA stimulation; this suggests a defect downstream of osteoblast differentiation.

670 The American Journal of Human Genetics 90, 661–674, April 6, 2012

with those published in a previous study (Garrigue-Antar

et al.37), which showed the importance of N-glycosylation

for secretion and stability of BMP1. Compromised BMP1

secretion thus reduces the availability of BMP1 in the

extracellular matrix, potentially leading to the insufficient

processing of substrates, including the procollagen I C-pro-

peptide. Indeed, we were able to demonstrate measurably

reduced processing of two substrates by p.Gly12Arg-

substituted BMP1 in two in vivo assays in zebrafish. Exog-

enous Bmp1 has been described as cleaving the dorsal

determinant, Chordin, and leading to ventralization of

zebrafish embryos.41 We exploited this finding to show

that the p.Gly12Arg-substituted version of BMP1 was less

efficient at ventralizing the embryo than the WT BMP1

(Figure 3). In addition, unlike the WT BMP1, the

p.Gly12Arg-substituted BMP1 was unable to measurably

rescue the larval fin ruffling of the bmp1a zebrafish mutant

(Figure S3); larval fin ruffling is associated with defective

collagen-rod formation. Thus, we have shown that

p.Gly12Arg leads to both reduced secretion and subse-

quent reduced processing of the substrates Chordin and

Collagen I.

Such reduced C-propeptide cleavage predicts the

assembly of procollagen instead of mature collagen into

collagen I fibrils. Immunoblot analysis of both larval and

adult zebrafish protein samples demonstrated that forms

retaining the C-propeptide predominate upon reduction

of Bmp1 function (Figure 8). Unfortunately, no material

was available to show this effect in the affected individuals.

We hypothesize that this results in an impairment of the

collagen matrix within the bone structure and could be

the major cause in the underlying pathomechanism.

Supporting this hypothesis, use of the birefringent

collagen stain, picrosirius red, demonstrated less ordered

collagen-fiber structure in zebrafish bmp1a mutants (Fig-

ure 8). Interestingly, recent studies have assumed that

collagen C-propeptides assembled within collagen fibrils

would either increase the intrafibrillar spacing or directly

serve as nucleators of mineralization.15 This might explain

the high bone mineral density we noted in our individuals

(Figure 1 and Table 1). Our findings provide evidence that

insufficient collagen processing caused by p.Gly12Arg

most likely leads to ectopic accumulation of minerals in

the bone. However, this bone is nonetheless structurally

compromised, leading to fragility. The precise pathophysi-

ology of increased bone mineralization upon defective

collagen processing remains unclear and needs to be inves-

tigated further.

We note similar defects in bone formation in the zebra-

fish frilly fins mutant, and we show that these defects

Figure 8. frf Displays Defects in Fibrillar Collagen Order and Col1a1a Processing(A) Major proteolytic roles of Bmp1 include removing the C-propeptide (orange ovals) of pro-Collagen I and cleaving the BMP2/4 inhib-itor, Chordin (red hexagon), to release free BMP2/4 (green oval).(B–G) Picrosirius-red stained sagittal (B–E) and transverse (F and G) sections of WT (B, D, and F) and frf�/� (C, E, and G) larvae at 11 dpf(B and C), 20 dpf (D and E), and 4 months (F and G). Sections viewed under polarized light reveal the reduced collagen-fiber-associatedbirefringency in the Centra region (B–E) and fin rays (F and G) in frf�/� mutant (C, E, and G) and WT (B, D, and F) larvae.(H and I) Transmission electron micrographs of longitudinal sections of WT (H) and frf�/� (I) larval medial fins at 6 dpf show loss ofstructured collagen fibers in the mutant.(J and K) Immunoblots of protein extracted fromWT (lane 1) and frf�/� mutant (lane 2) larvae probed with an antibody directed againstzebrafish Collagen1a1a (upper panels in both J and K) or an antibody against b-actin as a loading control (lower panels). The fourpossible Collagen1a1 forms are indicated on the right; these forms include Procollagen1a1 retaining both C- and N- terminal propetides(Pro a1[I]), mature collagen a1(I) retaining neither propeptide (a1[I]), a form retaining only the N-propeptide (pN a1[I]), and a formretaining only the C-propeptide (pC a1[I]). In 6 dpf (J) and 4-month-old (K) frf�/� mutants, the two forms retaining the C-propeptidepredominate.

The American Journal of Human Genetics 90, 661–674, April 6, 2012 671

correspond to mutations in bmp1a. Alizarin-red staining

demonstrated delayed ossification at larval stages and mal-

formation of adult skeletal structures with evidence of

fractures (Figure 4). Quantification of mineral content by

microCT analysis demonstrated a higher mineral content

of mature bone, as seen in the individuals. The reason for

the divergent mineral-content phenotypes between the

larval-stage frilly fins mutant and the adult-stage mutant

is currently unclear. We hypothesize that there is a differ-

ence in mineralization rate between the two stages and

that mature collagen is rate limiting during the initial

deposition in larval stages, whereas in adult stages, the

retained telopeptide gradually induces increased minerali-

zation through a mechanism that has not been deter-

mined. Accordingly, the only location found to have

reduced mineralization in adult frilly fins is the distal fin,

a site of bone deposition.

As the first described animal model with reduced

Bmp1 function in an adult, frilly fins was used for the

investigation of the role of this protease in bone forma-

tion. We showed that although generation and patterning

of osteoblasts is unaffected (Figure 6), there was an

intriguing alteration in morphology of the osteoblasts—

they adopted a cuboidal shape (Figure 7). Although this

might indicate a defect in the osteoblasts themselves, we

favor the interpretation that this is a result of reduced

adhesion to the compromised bone matrix. Supporting

this is the in vitro observation that osteoblasts cultured

on bone matrix lacking collagen appeared rounded com-

pared to flattened cells cultured on a purified mineralized

collagen matrix.42

The role of Bmp1 (and other Tolloid-related proteins) in

dorsal-ventral patterning through Chordin cleavage is

well documented. We could, however, conclusively

show that this was not the relevant substrate underlying

the frilly fins bone phenotype (Figure S5). In fact, multiple

lines of evidence from our analysis support the interpreta-

tion that the major role of Bmp1 in ossification is

removal of the C-propeptide from Collagen I. First, we

note similarity of frf to the microwaved mutant, which

we identified as a collagen1a1a mutant (Figure 4). Second,

frf larvae are not able to hyperossify the vertebral column

upon retinoic-acid stimulation of the osteoblasts (Fig-

ure 7), most consistent with a defect that is considerably

downstream in the process of osteoid formation. Finally,

we show compromised Collagen-I processing and

higher-order structure in frf biochemically and histologi-

cally (Figure 8).

While we were preparing this manuscript, Martinez-Glez

et al.43 described a homozygous missense mutation

causing an alteration in the protease domain of BMP1 in

two affected individuals from an Egyptian family affected

by severe autosomal-recessive OI. In line with our findings,

concomitant abnormal procollagen I C-propeptide pro-

cessing was described. Comparison of the individuals’

phenotypes in both studies showed the following differ-

ences: affected individuals in the Egyptian family pre-

sented with classical autosomal-recessive OI, whereas our

individuals, as well as the zebrafish model, presented

with bone fragility associated with an increase in bone

mineral density. Interestingly, Lindahl K. et al.15 very

recently described COL1A1 mutations affecting the BMP1

C-propeptide cleavage site; also, the individuals in this

study presented with an increased-mineralization OI

phenotype very similar to our individuals, suggesting

that impaired BMP1-related collagen C-propeptide cleav-

age (either by mutations in BMP1 or mutations affecting

the BMP1 cleavage site in COL1A1) causes a distinct

form of OI. In contrast, the missense mutation described

by Martinez-Glez et al.43 might have different functional

consequences and thereby cause phenotypic variability

and differences in the severity of the disease.

All together, our combined data in humans and

zebrafish define the molecular and cellular bases of

BMP1-dependent osteogenesis and show the importance

of this protein for bone formation and stability. These

data, in both humans and zebrafish, support the finding

that deficits in removal of the C-propeptide from Collagen

I result in autosomal-recessive OI with high bone mineral

density.15

Supplemental Data

Supplemental Data include five figures and one table and can be

found with this article online at http://www.cell.com/AJHG.

Acknowledgments

We are grateful to all family members that participated in this

study, Esther Milz for excellent technical assistance, Karin Boss

for critically reading the manuscript, and Kaicheng Liang from

the Singapore Bioimaging Consortium for microCT imaging.

This work was supported by the German Federal Ministry of

Education and Research by grant 01GM0880 (SKELNET) to B.W.

The authors would like to thank the National Heart, Lung, and

Blood Institute Grand Opportunity (GO) Exome Sequencing

Project and the following ongoing studies that produced and

provided exome-variant calls for comparison: the Lung GO

Sequencing Project (HL-102923), the Women’s Health Initiative

Sequencing Project (HL-102924), the Broad GO Sequencing

Project (HL-102925), the Seattle GO Sequencing Project (HL-

102926), and the Heart GO Sequencing Project (HL-103010).

Received: November 17, 2011

Revised: January 23, 2012

Accepted: February 24, 2012

Published online: April 5, 2012

Web Resources

The URLs for data presented herein are as follows:

ENSEMBL, http://www.ensembl.org

Exome Variant Server, http://snp.gs.washington.edu/EVS/

OMIM, http://www.ncbi.nlm.nih.gov/omim

PolyPhen, http://coot.embl.de/PolyPhen

UCSC Genome Browser, http://www.genome.ucsc.edu

672 The American Journal of Human Genetics 90, 661–674, April 6, 2012

References

1. Byers, P.H., and Cole, W.G. (2002). Osteogenesis Imperfecta. In

Connective Tissue and its Heritable Disorders: Molecular,

Genetic, and Medical Aspects, Second Edition, P. Royce and B.

Steinmann,eds. (Hoboken,NJ: JohnWiley&Sons),pp.385–430.

2. Sillence, D.O., and Rimoin, D.L. (1978). Classification of

osteogenesis imperfect. Lancet 1, 1041–1042.

3. Basel, D., and Steiner, R.D. (2009). Osteogenesis imperfecta:

Recent findings shed new light on this once well-understood

condition. Genet. Med. 11, 375–385.

4. Rauch, F., and Glorieux, F.H. (2004). Osteogenesis imperfecta.

Lancet 363, 1377–1385.

5. Marini, J.C., Forlino,A.,Cabral,W.A.,Barnes,A.M., SanAntonio,

J.D., Milgrom, S., Hyland, J.C., Korkko, J., Prockop, D.J.,

De Paepe, A., et al. (2007). Consortium for osteogenesis imper-

fectamutations in the helical domain of type I collagen: Regions

rich in lethal mutations align with collagen binding sites for

integrins and proteoglycans. Hum. Mutat. 28, 209–221.

6. Pollitt, R., McMahon, R., Nunn, J., Bamford, R., Afifi, A.,

Bishop, N., and Dalton, A. (2006). Mutation analysis of

COL1A1 andCOL1A2 in patients diagnosed with osteogenesis

imperfecta type I-IV. Hum. Mutat. 27, 716.

7. Morello, R., Bertin, T.K., Chen, Y., Hicks, J., Tonachini, L.,

Monticone, M., Castagnola, P., Rauch, F., Glorieux, F.H.,

Vranka, J., et al. (2006). CRTAP is required for prolyl 3- hydrox-

ylation and mutations cause recessive osteogenesis imper-

fecta. Cell 127, 291–304.

8. Cabral, W.A., Chang, W., Barnes, A.M., Weis, M., Scott, M.A.,

Leikin, S., Makareeva, E., Kuznetsova, N.V., Rosenbaum,

K.N., Tifft, C.J., et al. (2007). Prolyl 3-hydroxylase 1 deficiency

causes a recessive metabolic bone disorder resembling lethal/

severe osteogenesis imperfecta. Nat. Genet. 39, 359–365.

9. Christiansen, H.E., Schwarze, U., Pyott, S.M., AlSwaid, A., Al

Balwi, M., Alrasheed, S., Pepin, M.G., Weis, M.A., Eyre, D.R.,

and Byers, P.H. (2010). Homozygosity for a missense mutation

in SERPINH1, which encodes the collagen chaperone protein

HSP47, results in severe recessive osteogenesis imperfecta. Am.

J. Hum. Genet. 86, 389–398.

10. van Dijk, F.S., Nesbitt, I.M., Zwikstra, E.H., Nikkels, P.G.,

Piersma, S.R., Fratantoni, S.A., Jimenez, C.R., Huizer, M.,

Morsman, A.C., Cobben, J.M., et al. (2009). PPIB mutations

cause severe osteogenesis imperfecta. Am. J. Hum. Genet.

85, 521–527.

11. Lapunzina, P., Aglan, M., Temtamy, S., Caparros-Martın, J.A.,

Valencia, M., Leton, R., Martınez-Glez, V., Elhossini, R., Amr,

K., Vilaboa, N., and Ruiz-Perez, V.L. (2010). Identification of

a frameshift mutation in Osterix in a patient with recessive

osteogenesis imperfecta. Am. J. Hum. Genet. 87, 110–114.

12. Becker, J., Semler, O., Gilissen, C., Li, Y., Bolz, H.J., Giunta, C.,

Bergmann, C., Rohrbach, M., Koerber, F., Zimmermann, K.,

et al. (2011). Exome sequencing identifies truncating muta-

tions in human SERPINF1 in autosomal-recessive osteogenesis

imperfecta. Am. J. Hum. Genet. 88, 362–371.

13. Alanay, Y., Avaygan, H., Camacho, N., Utine, G.E., Boduroglu,

K., Aktas, D., Alikasifoglu, M., Tuncbilek, E., Orhan, D., Bakar,

F.T., et al. (2010). Mutations in the gene encoding the RER

protein FKBP65 cause autosomal-recessive osteogenesis im-

perfecta. Am. J. Hum. Genet. 86, 551–559.

14. Mann, V., and Ralston, S.H. (2003). Meta-analysis of COL1A1

Sp1 polymorphism in relation to bone mineral density and

osteoporotic fracture. Bone 32, 711–717.

15. Lindahl, K., Barnes, A.M., Fratzl-Zelman, N., Whyte, M.P.,

Hefferan, T.E., Makareeva, E., Brusel, M., Yaszemski, M.J., Ru-

bin, C.J., Kindmark, A., et al. (2011). COL1 C-propeptide

cleavage site mutations cause high bone mass osteogenesis

imperfecta. Hum. Mutat. 32, 598–609.

16. Myllyharju, J., and Kivirikko, K.I. (2004). Collagens, modi-

fying enzymes and their mutations in humans, flies and

worms. Trends Genet. 20, 33–43.

17. Bond, J.S., and Beynon, R.J. (1995). The astacin family of met-

alloendopeptidases. Protein Sci. 4, 1247–1261.

18. Sterchi, E.E., Stocker, W., and Bond, J.S. (2008). Meprins,

membrane-bound and secreted astacin metalloproteinases.

Mol. Aspects Med. 29, 309–328.

19. Ge, G., and Greenspan, D.S. (2006). Developmental roles of

the BMP1/TLD metalloproteinases. Birth Defects Res. C

Embryo Today 78, 47–68.

20. Ge, G., and Greenspan, D.S. (2006). BMP1 controls TGFbeta1

activation via cleavage of latent TGFbeta-binding protein.

J. Cell Biol. 175, 111–120.

21. Kessler, E., Takahara, K., Biniaminov, L., Brusel, M., and

Greenspan, D.S. (1996). Bone morphogenetic protein-1: The

type I procollagen C-proteinase. Science 271, 360–362.

22. Canty, E.G., and Kadler, K.E. (2005). Procollagen trafficking,

processing and fibrillogenesis. J. Cell Sci. 118, 1341–1353.

23. Suzuki, N., Labosky, P.A., Furuta, Y., Hargett, L., Dunn, R.,

Fogo, A.B., Takahara, K., Peters, D.M., Greenspan, D.S., and

Hogan, B.L. (1996). Failure of ventral body wall closure in

mouse embryos lacking a procollagen C-proteinase encoded

by Bmp1, a mammalian gene related to Drosophila tolloid.

Development 122, 3587–3595.

24. Parichy, D.M., Elizondo, M.R., Mills, M.G., Gordon, T.N., and

Engeszer, R.E. (2009). Normal table of postembryonic zebra-

fish development: Staging by externally visible anatomy of

the living fish. Dev. Dyn. 238, 2975–3015.

25. van Eeden, F.J., Granato, M., Schach, U., Brand, M., Furutani-

Seiki, M., Haffter, P., Hammerschmidt, M., Heisenberg, C.P.,

Jiang,Y.J., Kane,D.A., et al. (1996).Genetic analysis of fin forma-

tion in the zebrafish, Danio rerio. Development 123, 255–262.

26. Spoorendonk, K.M., Peterson-Maduro, J., Renn, J., Trowe, T.,

Kranenbarg, S., Winkler, C., and Schulte-Merker, S. (2008).

Retinoic acid and Cyp26b1 are critical regulators of osteogen-

esis in the axial skeleton. Development 135, 3765–3774.

27. Geisler, R. (2002). Mapping and cloning. In Zebrafish: A

practical approach, C. Nusslein-Volhard and R. Dahm, eds.

(Oxford: Oxford University Press), pp. 175–212.

28. Kwan, K.M., Fujimoto, E., Grabher, C., Mangum, B.D., Hardy,

M.E., Campbell, D.S., Parant, J.M., Yost, H.J., Kanki, J.P., and

Chien, C.B. (2007). The Tol2kit: A multisite gateway-based

construction kit for Tol2 transposon transgenesis constructs.

Dev. Dyn. 236, 3088–3099.

29. Rentzsch, F., Zhang, J., Kramer, C., Sebald, W., and Ham-

merschmidt, M. (2006). Crossveinless 2 is an essential positive

feedback regulator of Bmp signaling during zebrafish gastrula-

tion. Development 133, 801–811.

30. Balciunas, D.,Wangensteen, K.J.,Wilber, A., Bell, J., Geurts, A.,

Sivasubbu, S., Wang, X., Hackett, P.B., Largaespada, D.A.,

McIvor, R.S., and Ekker, S.C. (2006). Harnessing a high

cargo-capacity transposon for genetic applications in verte-

brates. PLoS Genet. 2, e169.

31. Thisse, C., and Thisse, B. (2008). High-resolution in situ

hybridization to whole-mount zebrafish embryos. Nat. Pro-

toc. 3, 59–69.

The American Journal of Human Genetics 90, 661–674, April 6, 2012 673

32. Brend, T., and Holley, S.A. (2009). Zebrafish whole mount

high-resolution double fluorescent in situ hybridization. J.

Vis. Exp. 25, 1229.

33. Laue, K., Janicke, M., Plaster, N., Sonntag, C., and Ham-

merschmidt, M. (2008). Restriction of retinoic acid activity

by Cyp26b1 is required for proper timing and patterning of

osteogenesis during zebrafish development. Development

135, 3775–3787.

34. Walker, M.B., and Kimmel, C.B. (2007). A two-color acid-free

cartilage and bone stain for zebrafish larvae. Biotech. Histo-

chem. 82, 23–28.

35. Borges, L.F., Gutierrez, P.S., Marana, H.R., and Taboga, S.R.

(2007). Picrosirius-polarization staining method as an effi-

cient histopathological tool for collagenolysis detection in

vesical prolapse lesions. Micron 38, 580–583.

36. Brown, A.M., Fisher, S., and Iovine, M.K. (2009). Osteoblast

maturation occurs in overlapping proximal-distal compart-

ments during fin regeneration in zebrafish. Dev. Dyn. 238,

2922–2928.

37. Garrigue-Antar, L., Hartigan, N., and Kadler, K.E. (2002). Post-

translational modification of bone morphogenetic protein-1

is required for secretion and stability of the protein. J. Biol.

Chem. 277, 43327–43334.

38. Scott, I.C., Blitz, I.L., Pappano, W.N., Imamura, Y., Clark, T.G.,

Steiglitz, B.M., Thomas, C.L., Maas, S.A., Takahara, K., Cho,

K.W., and Greenspan, D.S. (1999). Mammalian BMP-1/

Tolloid-related metalloproteinases, including novel family

member mammalian Tolloid-like 2, have differential enzy-

matic activities and distributions of expression relevant to

patterning and skeletogenesis. Dev. Biol. 213, 283–300.

39. Lee, K.S., Kim, H.J., Li, Q.L., Chi, X.Z., Ueta, C., Komori, T.,

Wozney, J.M., Kim, E.G., Choi, J.Y., Ryoo, H.M., and Bae,

S.C. (2000). Runx2 is a common target of transforming

growth factor beta1 and bone morphogenetic protein 2, and

cooperation between Runx2 and Smad5 induces osteoblast-

specific gene expression in the pluripotent mesenchymal

precursor cell line C2C12. Mol. Cell. Biol. 20, 8783–8792.

40. Fisher, S., and Halpern, M.E. (1999). Patterning the zebrafish

axial skeleton requires early chordin function. Nat. Genet.

23, 442–446.

41. Muraoka, O., Shimizu, T., Yabe, T., Nojima, H., Bae, Y.K.,

Hashimoto, H., and Hibi, M. (2006). Sizzled controls dorso-

ventral polarity by repressing cleavage of the Chordin protein.

Nat. Cell Biol. 8, 329–338.

42. Basle, M.F., Grizon, F., Pascaretti, C., Lesourd, M., and Chap-

pard, D. (1998). Shape and orientation of osteoblast-like cells

(Saos-2) are influenced by collagen fibers in xenogenic bone

biomaterial. J. Biomed. Mater. Res. 40, 350–357.

43. Martınez-Glez, V., Valencia, M., Caparros-Martın, J.A., Aglan,

M., Temtamy, S., Tenorio, J., Pulido, V., Lindert, U., Rohrbach,

M., Eyre, D., et al. (2012). Identification of a mutation

causing deficient BMP1/mTLD proteolytic activity in auto-

somal recessive osteogenesis imperfecta. Hum. Mutat. 33,

343–350.

674 The American Journal of Human Genetics 90, 661–674, April 6, 2012

ARTICLE

A ‘‘Copernican’’ Reassessment of the HumanMitochondrial DNA Tree from its Root

Doron M. Behar,1,2,* Mannis van Oven,3,* Saharon Rosset,4 Mait Metspalu,1 Eva-Liis Loogvali,1

Nuno M. Silva,5 Toomas Kivisild,1,6 Antonio Torroni,7 and Richard Villems1,8

Mutational events along the human mtDNA phylogeny are traditionally identified relative to the revised Cambridge Reference

Sequence, a contemporary European sequence published in 1981. This historical choice is a continuous source of inconsistencies,

misinterpretations, and errors in medical, forensic, and population genetic studies. Here, after having refined the human mtDNA

phylogeny to an unprecedented level by adding information from 8,216 modern mitogenomes, we propose switching the reference

to a Reconstructed Sapiens Reference Sequence, which was identified by considering all available mitogenomes from Homo neandertha-

lensis. This ‘‘Copernican’’ reassessment of the human mtDNA tree from its deepest root should resolve previous problems and will

have a substantial practical and educational influence on the scientific and public perception of human evolution by clarifying the

core principles of common ancestry for extant descendants.

Introduction

Nested hierarchy of species, resulting from the descent

with modification process,1 is fundamental to our under-

standing of the evolution of biological diversity and

life in general. In molecular genealogy, the sequential

accumulation of mutations since the time of the most

recent common ancestor (MRCA) is reflected within the

ever-evolving phylogeny of any genetic locus. Accordingly,

the reconstructed ancestral sequence of a locus should

optimally serve as the reference point for its derived

alleles.2 The human mtDNA phylogeny3–7 is an almost

perfect molecular prototype for a nonrecombining locus,

and knowledge on its variation has been and is extensively

used in medical, genealogical, forensic, and popula-

tion genetic studies.8–11 Boosted by rapid advances in

sequencing and genotyping technology, its mode of inher-

itance, high mutation rate, lack of recombination, and

high cellular copy number have proved critical in making

this locus the primary choice in the field of archaeoge-

netics and ancient DNA.12–14 Although its early synthesis

was based on restriction-fragment-length polymor-

phisms,15–18 control-region variation,19,20 or a combina-

tion of both,21 the human mtDNA phylogeny is now

reconstructed from complete mtDNA sequences,4,6,7,22

thus stretching the phylogenetic resolution to its maxi-

mum. mtDNA also became the main target of ancient-

DNA studies because it is much more abundant than

nuclear DNA.13 The recently published Homo neandertha-

lensis mitogenomes23,24 represent the best available out-

group source for rooting the human mtDNA phylogeny

known to lay inside the contemporary African varia-

tion.22,25,26 Despite these major advances, the extinct

human mtDNA complete root sequence was never

precisely determined, and mtDNA nomenclature remains

cumbersome because it refers to the first completely

sequenced mtDNA,27,28 labeled rCRS, which is now

known to belong to the recently coalescing European

haplogroup H2a2a1.7 The use of the rCRS as a reference

resulted in a number of practical problems such as (1)

the misidentification of derived versus ancestral states

of alleles and (2) the count of nonsynonymous muta-

tions that map to the path between the rCRS and

the case sequences.29 For instance, clinical and func-

tional studies frequently include among the putative

nonsynonymous candidate mutations the haplogroup-

HV-defining transition at position 14766 (CYTB) simply

because the revised Cambridge Reference Sequence

(rCRS) belongs to its derived haplogroup H.30

In this study, to definitively address these issues,

we propose a ‘‘Copernican’’ reassessment of the human

mtDNA phylogeny by switching to a Reconstructed

Sapiens Reference Sequence (RSRS) as the phylogenetically

valid reference point. To this end, the previously suggested

root7,22,25 was updated tomost parsimoniously incorporate

the available mitogenomes from H. neanderthalensis.23,24

Moreover, we further refined the human mtDNA

phylogeny to an unprecedented level by adding informa-

tion from 8,216 mitogenomes and evaluated the ranges

of nucleotide substitutions from the root RSRS rather

than the rCRS28 as a reference point (Figure 1 and Figure S1,

available online).

1Estonian Biocentre and Department of Evolutionary Biology, University of Tartu, Tartu 51010, Estonia; 2Molecular Medicine Laboratory, Rambam Health

Care Campus, Haifa 31096, Israel; 3Department of Forensic Molecular Biology, Erasmus MC, University Medical Center Rotterdam, 3000 CA Rotterdam,

The Netherlands; 4Department of Statistics and Operations Research, School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel; 5Instituto

de Patologia e Imunologia Molecular da Universidade do Porto, Porto 4200-465, Portugal; 6Department of Biological Anthropology, University of

Cambridge, Cambridge CB2 1QH, UK; 7Dipartimento di Biologia e Biotecnologie ‘‘L. Spallanzani,’’ Universita di Pavia, Pavia 27100, Italy; 8Estonian

Academy of Sciences, 6 Kohtu Street, Tallinn 10130, Estonia

*Correspondence: [email protected] (D.M.B.), [email protected] (M.v.O.)

DOI 10.1016/j.ajhg.2012.03.002. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 675–684, April 6, 2012 675

6

1.3

2.2

0.5

0.15

0.03

0L0d1c1b

(EU092832)H2a2a1

rCRS(NC_012920)H4a1a

(HQ860291)

53 M

UTA

TIO

NS

54 M

UTA

TIO

NS

46 M

UTA

TIO

NS

99 M

UTA

TIO

NS

13 MUTATIONS

G10589A

A12720GG12007A

T5442C

C9042TA9347G

G263AG1048TC3516a

T6185C

L0 L1’2’3’4’5’6

A11914GG13276A

C10915T

G16230A

C182TT4312C

C146T

T10664C

Pan paniscus

Pan troglodytes

Homo neander-thalensis

Homo sapiens

RSRSRNRS

Mya

Hominini

RSRS

222a1a1111111a222a1a111111a1a1aa aaa 1122

C8209T

A8348G

T12011C

A11560G

G5262AT4928C

C6518TA6131G

G6962AG7146A

A3564GA3334G

T4101CT3504C

G3438A

T6185C

T245CG263A

C152TG185A C262T

A2294G A1779G

C146T A200G

C146T

T13488C

G15077A

G1048TC182T

T8167C

C7650T

C10915TC9042TA11914G

A15775G

A16078G

C3516aT4312C

T16086C

T16154C

T5442CT10664C

A12810G

T14063C

A2758G

C3556TT3308C

A12720G

A574G G3483AT990C T12864C

C16344T

A9347GG13276AG10589AG16230A

G10586A A16258G

G12007A

G16156A

A14926G A5189tT16093C

291d361.1A

A16129G

T5964C G200A!A10520G T391CA13917G T4688C

L0L1’2’3’4'5’6FM865411 FM865408 FM865409 AM948965 FM865410 FM865407 H2a2a1

H2

H2a2a

H2a

H2a2

C152TA2758GC2885TG7146A

A825tT8655C

A10688GC10810TG13105AT13506C

T8468C

L2'3’4’5’6

C195TA247G

522.1AC

A7521G

L3’4'6

T182C!T3594CT7256CT13650C

G15301AA16129GT16187CC16189T

L2'3’4’6

G4104A

G8701AC9540T

G10398AC10873TA15301G!

N

T16278C

L3'4

A769GA1018GC16311T

L3

T14766C

HV

G2706AT7028C

H

G1438A

T12705CT16223C

R

G73AA11719G

R0

G8860AG15326A

rCRS

G4769A

G750A

G263A

RSRS

97559456

93459329

93259053

90278986

89438764

87188503

84618455

84068386

83658065

80217891

78687861774674247127710666416620645264106266626062006156602358405821567355805505547154605387494049044856456245324204404839393918390938083414339930102863283127062523205617091406827709547521-522438417243195189150

986910101

1025610281

1030710310

1032410373

1053210750

1138311458

11527115901162311770119501207012189123511236612406124741309513194132691335913506136501365613680137071380113879138891405314144141781429614560150431514815191152261523215295153011535515443154791562915649156671567115789158501603716139161481616916183161871620916234162441625616262

16263.116299163201636216400

Homo neanderthalensis mtDNA genomes Homo sapiens rCRS genome

RNRS

Figure 1. Schematic Representation of the Human mtDNA Phylogeny within Hominini(Left) Hominini phylogeny illustrating approximate divergence times of the studied species. The positions of the RSRS and the putativeReconstructed Neanderthal Reference Sequence (RNRS) are shown.(Right)Magnification of the humanmtDNA phylogeny. Mutated nucleotide positions separating the nodes of the two basal human hap-logroups L0 and L1’20304’506 and their derived states as compared to the RSRS are shown. The positions of the rCRS and the RSRS areindicated by golden and a green five-pointed stars, respectively. Accordingly, the number of mutations counted from the rCRS(NC_012920) or the RSRS (Sequence S1) to the L0d1c1b (EU092832) and H4a1a (HQ860291) haplotypes retrieved from a San anda German, respectively, are marked on the golden and green branches. The principle of equidistant star-like radiation from the commonancestor of all contemporary haplotypes is highlighted when the RSRS is preferred over the rCRS as the reference sequence.

676 The American Journal of Human Genetics 90, 675–684, April 6, 2012

Subjects and Methods

Updating the Human mtDNA Phylogeny and

Inference of the Ancestral Root HaplotypeMtDNA Genomes Comprising the Phylogeny

A total of 18,843 complete mtDNA sequences were used to refine

the human mtDNA phylogeny of which 10,627 were previously

reported and used for the mtDNA tree Build 13 (28 Dec 2011)

as posted by PhyloTree.7 The remaining 8,216 sequences are

mainly from the large complete mtDNA database available at

FamilyTreeDNA and in part from data sets maintained by the

authors. The large database available at FamilyTreeDNA was

privately obtained by the sample donors, usually for genealogical

purposes. Most donors were of western Eurasian ancestry, but

donors with matrilineal ancestry from other geographical regions

have also contributed. Once the mtDNA sequences were obtained,

donors had several options: keep them confidential, share them

with peer genealogists, submit them to the National Center for

Biotechnology Information (NCBI) GenBank, and/or consent to

contribute them anonymously to a research database maintained

by FamilyTreeDNA to improve the mtDNA phylogeny. In turn,

this contribution rewards and enriches the genealogical experi-

ence as well as benefits the scientific community. All the proce-

dures followed in this study were in accordance with the ethical

standards of the responsible committee on human experimenta-

tion of the participating research centers.

Likewise, it is important to clarify that because the complete

sequences were obtained privately, some donors have indepen-

dently uploaded their sequence to NCBI. Currently (as of February

28, 2012), a total of 1,220 complete mtDNA sequences that were

generated at FamilyTreeDNA were privately deposited in NCBI

GenBank. Most of these sequences were already considered in

the previous PhyloTree Builds.7 Because we have no way to

know which of the sequences were autonomously uploaded to

NCBI, all duplicate sequences that matched precisely between

NCBI and our database were excluded from our analysis. There-

fore, even if multiple samples were excluded, no topological infor-

mation was lost. Accordingly, out of the 8,216 sequences used

to verify the phylogeny, a total of 4,265 sequences are released

and deposited in NCBI GenBank under accession numbers

JQ701803–JQ706067. The complete mtDNA sequences of the

Neanderthals were retrieved from the literature.23,24

Complete mtDNA Sequencing

DNAwas extracted from buccal swabs. MtDNAwas amplified with

18 primers to yield nine overlapping fragments as previously

reported.22 PCR products were cleaned with magnetic-particle

technology (BioSprint 96; QIAGEN). After purification, the nine

fragments were sequenced by means of 92 internal primers to

obtain the complete mtDNA genome. Sequencing was performed

on a 3730xl DNA Analyzer (Applied Biosystems), and the resulting

sequences were analyzed with the Sequencher software (Gene

Codes Corporation). Mutations were scored relative to the rCRS

and the suggested RSRS. Sample quality control was assured as

follows:

(1) After the PCR amplification of the nine fragments, DNA

handling and distribution to the 96 sequencing reactions

was aided by the Beckman Coulter Biomek FX liquid

handler to minimize the chance for human pipetting

errors.

(2) All 96 sequencing reactions of each sample were performed

simultaneously in the same sequencing run. Most observed

mutations were determined by at least two sequence reads.

However, in a minority of the cases only one sequence read

was available because of various technical reasons, usually

related to the amount and quality of the DNA available.

(3) Any fragment that failed the first sequencing attempt or

any ambiguous base call was tested by additional and

independent PCR and sequencing reactions. In these cases,

the first hypervariable segment (HVS-I) of the control

region was resequenced too to assure that the correct

sample was retrieved.

(4) Genotyping history for each sample was recorded to help

in the search for DNA handling errors and artificial recom-

bination events.

(5) All sequences were aligned with the software Sequencher

(Gene Codes Corporation), and all positions with a Phred

score less than 30 were manually evaluated by an operator.

Two independent operators read each sequence. All posi-

tions that differed from the reference sequences were

recorded electronically to minimize typographic errors.

(6) Any sequence that did not comfortably fit within the estab-

lished human mtDNA phylogeny was highlighted and

resequenced to exclude potential lab errors.

(7) Any comments and remarks raised by external investiga-

tors after release of the data will be addressed by reassessing

the original sequences for accuracy. After that, any unre-

solved result will be further examined by resequencing

and, if necessary, immediately corrected.

Tree Reconstruction and Notation of MutationsThe phylogeny was reconstructed by evaluating both all previ-

ously available published and the herein released complete

mtDNA sequences aiming at the most parsimonious solution

and aided by the software mtPhyl. Polymorphic positions are

shown on the branches and reticulations were resolved by consid-

ering the degree of mutability of individual positions as counted

by their number of occurrences in the overall phylogeny. Both

the ancestral and derived base status for each mutation appearing

in the phylogeny according to the International Union Of Pure

And Applied Chemistry (IUPAC) nucleotide code are reported.

We use capital letters for transitions (e.g., G73A) and lowercase

letters for transversions (e.g., A73t). Although heteroplasmies are

not noted in the phylogeny, we recommend labeling them by

using IUPAC code and capital letters (e.g., G73R). Throughout

the phylogeny indels are given with respect to the RSRS andmain-

tain the traditional nucleotide position numbering as in the rCRS.

Sequencing alignment prefers 30 placement for indels, except in

cases where the phylogeny suggests otherwise.31 Deletions are

indicated by a ‘‘d’’ after the deleted nucleotide position (e.g.,

T15944d). Insertions are indicated by a dot followed by the posi-

tion number and type of inserted nucleotide(s) (e.g., 5899.1C for

a C insertion at the first inserted nucleotide position after position

5899 and 5899.2C for a subsequent C insertion, and these are

abbreviated as 5899.1CC when occurring on the same branch).

We label polynucleotide stretches of unknown length as follows:

573.XC. In cases where an insertion occurred at an ancestral

branch but a reversion of this insertion (¼ deletion) took place

at a descendant branch, we noted the latter as follows:

5899.1Cd. An exclamationmark (!) at the end of a labeled position

denotes a reversion to the ancestral state. The number of exclama-

tion marks stands for the number of sequential reversions in

the given position from the RSRS (e.g., C152T, T152C!, and

The American Journal of Human Genetics 90, 675–684, April 6, 2012 677

C152T!!). Some indel positions have been a source of confusion

because multiple alignment solutions enable alternative scoring.

Notably, the dinucleotide repeat in hypervariable segment II

(HVS-II) of the control region can be viewed either as a CA repeat

starting at position 514 or as an AC repeat starting at position 515,

leading to two different notations being in use for a repeat loss:

522–523d versus 523–524d. We adhered to the guidelines for

consistent treatment of mtDNA-length variants that were estab-

lished by the forensic genetic community31 and favor the AC

interpretation. As the RSRS has one AC unit less compared to

the rCRS, we filled positions 523 and 524 of the RSRS with "NN,"

thereby preserving the historical genome annotation numbering.

Consequently, an AC insertion compared to the RSRS is scored as

522.1AC, whereas an AC deletion is scored as 521–522d. Table S2

presents all common indel positions throughout the complete

mtDNA sequence and the way we labeled them. Transitions at

the hypervariable position 16519, insertions of one or two Cs at

positions 309, 315, and 16193, A to C transversions at 16182

and 16183, as well as length variation of the AC dinucleotide

repeat spanning 515–522, were excluded from the phylogeny.

Haplogroup labels were re-evaluated and the following sugges-

tions were made:

(1) Monophyletic clades that are composed of two or more

previously named haplogroups are labeled by concate-

nating their names and separating them by apostrophe

(e.g., L0a’b). This is not applied in the case of capital-

letter-only labeled haplogroups (e.g., JT);

(2) We suggest labeling an extant sample that matches

a haplogroup root with the superscript case letter n for

‘‘nodal’’ (e.g., Hn);

(3) We note that when completemtDNA sequences are consid-

ered, the inability to differentiate a nodal haplotype from

an unresolved paraphyletic clade is eliminated. Accord-

ingly, the haplogroup label of each observed complete

mtDNA sequences can: (1) mark it in a nodal position; (2)

affiliate it with a previously labeled haplogroup; (3) suggest

a, so far, unlabeled haplogroup; or (4) in the absence of

two additional samples to justify the labeling of a, so far,

unidentified haplogroup, affiliate it with the ancestral

haplogroup. So, the label of a given sample as ‘‘H’’ means

that it is an unlabeled descendent of haplogroup H that

cannot be affiliated to any known H haplogroup clade

at the time of report and based on complete mtDNA

sequence. We suggest restricting the use of label ‘‘H*’’ to

cases where the haplogroup labeling is based on partial

mtDNA sequence;

(4) To aid the nonexpert in understanding the mtDNA hap-

logroup nomenclature system, we summarize in Table S3

the cases where haplogroup labels do not logically follow

from the hierarchy and hence could lead to confusion.

Changing these haplogroup labels to make them more

logical is undesirable at this stage because they are already

used extensively in the literature and therefore changing

them would probably cause even more confusion. In addi-

tion, we note that for the most basal nodes of the

phylogeny, historically the following shorthand names

have been in use: L1’5 ¼ L1’20304’506; L205 ¼ L20304’506;L206 ¼ L20304’6; and L4’6 ¼ L304’6, which we will herein

refer to by their full name. One shorthand haplogroup

name, M4’’67, is maintained because writing it in full

(M4’18’30’37’38’43’45’63’64’65’66’67) seems impractical.

It is important to note that the aim of this study is to publish the

most up-to-date human mtDNA phylogeny, and it cannot be

regarded by any means as a population-level survey exploring

the frequencies and distributions of the various haplogroups.

Therefore, although all sequences were used to establish the tree

topology, the subset of sequences actually presented in the

phylogeny is lower because for each branch up to two representa-

tive example sequences are provided. In most cases, we labeled

haplogroups only when supported by at least three distinct haplo-

types to maximize the accuracy of the haplogroup defining array

of mutations and to avoid the establishment of haplogroups

resulting from sequencing errors. Exceptions included previously

established haplogroups or haplogroups supported by a particu-

larly long array of mutations. Accordingly, the tips of the herein

released phylogeny are in fact internal haplogroup nodes, thus

private mutations (if any) of individual haplotypes were not

included.

Evaluation of the mtDNA Clock and Age EstimatesSubstitution Counts and Molecular Clock

To calculate the substitution counts from the RSRS to every extant

mitogenome (which is a tip in the mtDNA phylogeny), we

summed up the number of mutations on the path leading to

each noted haplogroup in the phylogeny and added to this the

number of positions that differed between the tip and the root

of the haplogroup. Thus, we are guaranteed to correctly count

all parallel and back mutations, except for the case where two

mutations affecting the same position occurred on a branch in

the tree (in which case we either count zero instead of two, if

the second is a back mutation, or one instead of two, if the second

mutation is not back to the initial state). As has been argued in the

past, such repeatedmutations within a single branch in the highly

resolved human mtDNA tree are highly unlikely,32 and are even

more so if the fastest mutating sites (16519 and the A to C trans-

versions and poly-C insertions around the HVS-I position 16189)

are eliminated, as was done in our analysis.

To test the validity of molecular clock assumption on human

mtDNA substitutions, we used PAML 4.4 with the HKY85 substitu-

tion model to generate maximum likelihood estimates of branch

lengths with and without the molecular clock assumption. We

chose to sample around 200–300 sequences and analyze their

coalescent tree (a subtree of the complete tree) in each PAML

run, to accommodate PAML’s computational limitations, and

also to sample mostly deep branches (such as M44), rather than

the recent and very short branches (such as D4a1b1) of the over-

sampled haplogroups such as H and D. Thus, we preferentially

sampled haplogroups whose coalescence with other samples in

the tree was more ancient. This ensured that even in such

a sample, the deeper clades such as the basal M clades would

be represented with high probability, whereas more recently

coalescing haplogroups such as the ones of haplogroup D would

be rarely sampled.

The generalized likelihood ratio (GLR) test for validity of the

clock assumption then uses the test statistic 2 3 (log-likelihood

of non-clock model � log-likelihood of clock model), which,

under the null hypothesis of molecular clock, has a c2 distribution

with degrees of freedom equal to the number of parameters under

no clock (¼ number of branches in the tree) minus number of

parameters under clock (¼ number of internal nodes in the tree).

We performed the analyses on two sets of the mtDNA

sequences: once by using the coding region alone and once on

the entire molecule. This was done as another sanity check for

678 The American Journal of Human Genetics 90, 675–684, April 6, 2012

the validity and generality of our results. All obtained p values are

presented in Table S4.

Age Calculations Assuming a Molecular Clock

In spite of thediscovered clockviolations,wewere still interested in

applying the best available tools for estimating the ages of ancestral

nodes in the tree assuming a molecular clock. We adopted the

calculation approach andmutation rate estimate of,32 who suggest

to estimate ages in substitutions and then transform them to years

in a nonlinear manner accounting for the selection effect on non-

synonymous mutations. We used PAML 4.433 with the HKY85

substitution model to generate maximum likelihood estimates of

internal node ages under a molecular clock assumption. Because

PAML is computationally limited in the size of trees it can analyze,

weperformed estimation for thewhole tree in several separate runs.

We divided the tree into seven collections of haplogroups:

d All L haplogroups (i.e., the entire phylogeny excluding M

and N)

d All of M excluding D

d D and JT

d H excluding H1 and H5

d B4’5 and HV excluding H but including H1 and H5

d U

d N excluding HV, U, JT and B4’5

For each PAML run, we selected all sequences belonging to one

of these sets, and added a small random sample of other samples

from the rest of the phylogeny to maintain ‘‘calibration.’’ Putting

together the estimates from all seven runs provided us with age

estimates for all nodes in our tree. Estimates are given in Table S5.

Data TransitionWe are aware that the suggested change can raise difficulties and

even antagonism from the scientific community. On the other

hand, a scenario in which a reference sequence of a genetic locus

does not represent its ancestral sequence should, indisputably, be

corrected. The realization of the superiority of complete mtDNA

sequence analysis compared to other approaches, combined

with the emergence of deep sequencing technologies, will possibly

shift the entire field into the use of only complete mtDNA

sequences in the near future.34–36 Therefore, the sooner the

change is made the less ‘‘painful’’ it will be. As the common

practice for reporting complete mtDNA sequences is by posting

the sequences as FASTA files to NCBI, rather than reporting the

substitutions with respect to a reference sequence (as in the case

of many data sets restricted to control-region variation), no major

change is needed. When a FASTA file is available or created, the

only change needed is to switch the reference sequence to the

RSRS. For control-region-based data sets, the conversion might

be more problematic as the common practice to report the

sequences in literature did not involve FASTA files but recorded

mutations as compared to the rCRS. Table S6 compares the classic

diagnostic mutations for the major haplogroups relative to the

rCRS or the RSRS.

To facilitate data transition we release the tools ‘‘FASTmtDNA,’’

which allows transformation of Excel list-type reports of mtDNA

haplotypes into FASTA files, and ‘‘mtDNAble,’’ which labels

haplogroups, performs a phylogeny-based quality check and

identifies private substitutions. These noted features are fully

supported in a web interface or as standalone versions, which

can be freely downloaded from thewebsite including theirmanual

and example files. In addition, the web interface allows the

benefit of comparing private substitutions between submitted

and previously stored mitogenomes to suggest the labeling of

additional haplogroups. Following quality check and consent, the

web interface enables the storing of complete mtDNA sequences

by members of the mtDNA community to enrich a growing

database. This in turn is expected to strengthen the data set used

by the website to label haplogroups, perform quality control and

refine the phylogeny. Additional tools will be periodically added

and updated.

Results

The RSRS

Since the sub-Saharan haplogroup L0 was defined,37 it

became clear that the root of the extant variation

of human mitochondrial genomes is allocated between

haplogroups L0 and L1’20304’506, which are separated

from each other by 14 coding and four control-region

mutations22 (Figure 1). Until now, our understanding of

the root of the human mtDNA tree was incomplete

because of the absence of reliable closely related outgroup

mitogenomes, and the exact placement of the 18 muta-

tions separating the L0 and L1’20304’506 nodes remained

vague. In principle, ancient mtDNA from early human

fossils might be informative but unreachable because of

considerable technical problems inherent to the analysis

process.13 However, as the split between H. sapiens and

H. neanderthalensis certainly predates the appearance of

the RSRS,38 a resolution of the deepest node might

be achieved by rooting the human phylogeny with

H. neanderthalensis complete mtDNA sequences23,24

(Figure 1). Table S1 shows all substitutions separating hap-

logroup L0 from L1’20304’506, their status in the six

H. neanderthalensis mitogenomes and their most parsimo-

nious allocation around the human root. Accordingly,

the ancestral mtDNA sequence of extant humans should

correspond to the bifurcation of L0 and L1’20304’506.Although it cannot be excluded that further sampling of

the African mtDNA variation might reveal yet another

more basal clade of the human mtDNA tree, it is at least

equally valid to indicate that, in spite of the many

thousands of reported complete mtDNA sequences,7 such

a clade has not been found so far. Operating under this

assumption we established the reference point, RSRS,

which is made available as Sequence S1.

We present the most resolved human mtDNA

phylogeny by compiling the information from 18,843

mitochondrial genomes of which 10,627 were previously

summarized in PhyloTree Build 13 (28 Dec 2011).7 We fol-

lowed the established cladistic notation for haplogroup

labeling adjusted for complete mtDNA genomes.7,39 Yet,

in contrast with the previously reported phylogeny, all

mutational changes noted on the branches of the tree indi-

cate the actual descendant nucleotide state relative to the

state in the RSRS. Although this has no effect on the tree

topology per se, it is critical to emphasize its major conse-

quences in the way of reporting the list of mutations

The American Journal of Human Genetics 90, 675–684, April 6, 2012 679

denoting an mtDNA haplotype. Accordingly, although the

HVS-I haplotype of a nodal haplogroup H2a2a1 mitoge-

nome will show no differences when compared to the

rCRS, its differentiation relative to the RSRS is now docu-

mented by the transitions A16129G, T16187C, C16189T,

T16223C, G16230A, T16278C and C16311T. This

common practice of expressing haplotypes as a string of

differences from the rCRS (Figure 1) led, for instance,

many inexperienced readers to incorrectly hold the ‘‘fact’’

that African haplogroup L mitogenomes have more substi-

tutions separating them from the rCRS as compared to

western Eurasian haplogroup H mitogenomes as a ‘‘proof’’

of an African origin for all contemporary humans.

Indications for Violation of the Molecular Clock

The accepted notion of a molecular clock means that

contemporary mtDNA haplotypes should show statisti-

cally insignificant differences in the number of accu-

mulated mutations from the RSRS.40 Triggered by the

suggested change in the reference sequence that facili-

tates substitution counts from the ancestral root, we

further evaluated this hypothesis. The range of sub-

stitution counts separating contemporary mitogenomes

belonging to major haplogroups from the RSRS is shown

in Figure S2. The mean distance is 57.1 substitutions, the

median is 56 and the empirical standard deviation is 5.9.

Widely different distances ranging from 41 substitutions in

some L0d1a1 mitogenomes to 77 in some L2b1a mitoge-

nomes are observed. Interestingly, the ranges of sub-

stitution counts within haplogroups M and N, which are

hallmarks of the relatively recent out-of-Africa exodus of

humans, are also very large. For example, within M there

are two mitogenomes with 43 substitutions (in M30a and

M44) and two mitogenomes with as many as 71 substitu-

tions (in M2b1b and M7b3a). This is especially striking

because the path from the RSRS to the root of M already

contains 39 substitutions. Hence, the difference between

the M root and its M44 descendant is only four substitu-

tions (two in the coding region and two in the control

region) as compared to 32 substitutions in the M2b1b

and M7b3a mitogenomes. These observations raise the

possibility that the tree in general, and haplogroup M in

particular, might not adhere uniformly to the assumed

molecular clock, under which substitutions occur at a fixed

rate on all branches of the tree over time.We evaluated this

scenario by performing generalized likelihood ratio tests of

the molecular clock by using PAML33 on subsets of samples

from the entire tree, on haplogroup L2 (following past

evidence of clock violations in this haplogroup40) and on

the sister haplogroups M and N. Our results demonstrate

violations of the molecular clock in M (0.00015 %

p value % 0.0003 for c2 GLR test in three different anal-

yses) and give mixed results for the entire tree (p ¼ 0.005

and p ¼ 0.018 for two analyses, which might be sensitive

to the parts of the tree randomly sampled) and L2 (GLR

c2 p value¼ 53 10�5 and p value¼ 0.033 for two analyses)

and borderline results in N (GLR c2 p value ¼ 0.049 and

p value ¼ 0.054 in two analyses). We are currently unable

to offer well-founded explanations for these findings,

which remain the scope of future studies.

As the clock violation was observed only in a restricted

number of specified cases, we applied the best available

tools for estimating the ages of ancestral nodes. We adop-

ted a conventional calculation approach and mutation

rate32 and used PAML 4.4 to generate maximum likelihood

estimates for internal node ages under a molecular clock

assumption.33 Figure 2 displays the phylogeny and density

of extant haplogroups as a function of both the number of

substitutions occurring since the RSRS and the estimated

coalescence times.

Approaching a Perfect Phylogeny

Themitochondrial genomes released herein almost double

the number of sequences that were previously available.

Despite the fact that the sequences released in this study

are not equally representative of all human populations

but aremainly from donors of western Eurasianmatrilineal

ancestry, a few additional advantages arise from this com-

bined data. First, an almost final level of resolution for

a number of western Eurasian clades was achieved, and

the nodes of ancestral and derived haplogroups are often

differentiated by a single mutation. For example, Figure 3

−170 −150 −130 −110 −90 −70 −50 −30 −10

050

100

200

300

400

500

600

KYBP

MtD

NA

hap

logr

oups

1 7 12 18 24 30 36 42 49

Substitutions since RSRS

L0L1

L5L2L6 L4

L3M

N

R rCRS

RSRS

Figure 2. Human mtDNA PhylogenyA schematic representation of the most parsimonious humanmtDNA phylogeny inferred from 18,843 complete mtDNAsequences with the structure shown explicitly for bifurcationsthat occurred 40,000 years before present (YBP) or earlier, anda graph showing the explosion of haplogroups since then. They axis indicates the approximate number of haplogroups fromeach time layer that have survived to nowadays. The upper andlower x axes of the rooted tree are scaled according to the numberof accumulated mutations since the RSRS and the correspondingcoalescence ages, respectively.

680 The American Journal of Human Genetics 90, 675–684, April 6, 2012

compares the resolution of haplogroup H4 as first41 and as

currently resolved. This comprehensive level of resolution

minimizes the chance of additional nomenclature issues

arising in future studies. Second, the highly resolved phy-

logeny is a powerful tool for quality assessment.29,42–44

Mapping any additional complete mtDNA haplotype to

such highly resolved phylogeny will highlight potential

sequencing errors and problems such as sample mix-

up, contamination, and typographical errors. Third, the

phylogeny itself is a useful resource for future evolutionary,

clinical, and forensic studies.45–51

Discussion

Thirty-one years ago, Anderson and colleagues27 published

the first complete sequence of human mtDNA. This

became the reference sequence inmultidisciplinary studies

that revolutionized human genetics, leading, for instance,

to the concept of ‘‘late-out-of-Africa’’ (‘‘African Eve’’)

peopling of the world by modern humans,17,18 the identi-

fication of a wide range of pathological mtDNA muta-

tions,52,53 and the possibility of reconstructing the origins

and the relationships of modern as well as ancient popula-

tions.12,14,54 The publication of globally selected complete

mtDNA genomes about 10 years agomarked the beginning

of the genomic era in this field.4 Since then, progress has

been impressive. Most admirable is the penetration of

the principles applied in the field of archaeogenetics to

hundreds of thousands of people around the world who

became interested in their matrilineal descent. In fact, in

this paper we add information from more than 8,000

complete mtDNA sequences resulting largely from the

curiosity and enthusiasm of lay people to the ~10,000

publicly available complete mtDNA sequences. However,

as discussed above, the entire field faces a problem: the

traditional manner of reporting variation observed in

human mitochondrial genome sequences is, to be blunt,

conceptually incorrect.

Supported by a consensus of many colleagues and after

a few years of hesitation, we have reached the conclusion

that on the verge of the deep-sequencing revolution,47,55

when perhaps tens of thousands of additional complete

mtDNA sequences are expected to be generated over the

next few years, the principal change we suggest cannot

be postponed any longer: an ancestral rather than a ‘‘phylo-

genetically peripheral’’ and modern mitogenome from

Europe should serve as the epicenter of the humanmtDNA

reference system. Inevitably, the proposed change could

raise some temporary inconveniences. For this reason, we

provide tables and software to aid data transition.

What we propose is much more than a mere clerical

change. We use the Ptolemaian geocentric versus Coper-

nican heliocentric systems as a metaphor. And the meta-

phor extends further: as the acceptance of the heliocentric

system circumvented epicycles in the orbits of planets,

7311

719

R

1476

6

d522

-523 1276

4510

217

1137

712

879

1476

616

256

1635

2

3992

4024

5004

7581

9123

1436

514

582

1549

715

930

1616

411 H4

d522

-523

9033

1077

513

513h

1620

916

215T

59

H1

4

456

1630

4

200

4336

5839

1552

116

093

5471

1286

4

13 H5

a

H5

15

709

1608

1618

9

14

239

1636

216

482 44

+ C15

221

462

6386

6814

040

1630

0

3915

4727

9380

1058

916

129

1624

9

16

H6

aH

6b

H617

55 57 1117

3847

6253

1099

3

21

H1

5

1651

9

152 72 183

1598

1606

616

239

60

3460

3786

1153

6

61

1636

2

62

73 8557

9368

1235

816

145

28

6908

7711

1551

916

291

29

3591

4310

9148

1302

016

168

30 H9

3010

6776

73

6320

8468

9921

1497

816

051

1616

216

259

H1

a

33

1808

5460

1378

215

817

1631

8

32

d522

-523

2483

3796

5899

+2C

7870

8348

9022

1256

116

189

1635

616

362

H1

b

36

236

709

1900

5899

+C60

4016

294

35

228

523+

CA

1129

916

233

34

368

1000

316

291

38

723

7271

8952

1154

916

311

39

1428

7

3666

1171

940

6216

294

4041

1623

4

42

573+

3C13

943

43

1504

716

189

37

4769

152

1081

016

274

1842

1123

313

708

1432

316

291

23

H224

H2c

1438

152

319

8598

1328

113

928

1626

616

311

1636

216

519

22

93

95C

1555

8258

1590

2

45

5471

1479

8

46

152

4679

1287

913

404

1415

216

239G

1631

1

47 H3

a

73 761

1432

5

44

183

709

2581

3387

G59

11

49

1295

7

72 150

1536

1066

714

467

195

1555

1420

016

176

1651

9

5251

1555

1623

4

50

1629

0

53

4793

185

1719

8573

1310

514

560

1621

3

1598

6296

A16

265

26 H7

25

48

195

961G

8448

8898

1375

916

278

1631

1

2392

6719

9530

1263

316

209

1639

9

252

2308

1036

1

19

54

H1

1

146

709

1310

1C16

111

1616

716

288

1636

2

3936

1455

216

287

18

55

H8

H1

2

20

195

4216

5378

1447

0A14

548

1611

4

H1

031

2259

4745

1368

014

872

93 7337

1304

213

326

573+

C16

519

7471

+C94

4911

563

1354

215

712

1627

816

311

H1

3

56

57

H3

H1

58

H1

3a

2706

7028

*

275348

1235

113

266C

60+T 64 152

153

2355

2442

3438

3847

1072

813

188

1567

416

126

1636

2

150

3290

5134

6263

9585

1269

6

2758

3834

6317

7094

1035

611

252

1616

843

711

674

1480

016

320

(pre

-HV

)1H

V1

HV

*V

2

3

1

7

195

523+

CA

5093

6059

7762

1171

913

933

5

7216

298

pre

*V1

1590

4

5581

8557

1522

116

222

6

pre

* V2

pre

-V

8014

T15

218

1606

7 750

7569

8376

9755

1353

516

519

4

4919

6285

1273

214

299

1624

116

311

237

1555

3531

4715

5201

8838

1045

412

362

1273

013

928

1633

5

10

9

4639

8869

1037

9

8

4580

737311

719

1171

9

R

1476

614

766

d522

d522

-523523 1276

4510

217

1137

712

879

1287

914

766

1476

616

256

1635

2

d522

d522

-52352

390

3310

775

1351

3h16

209

1620

916

215T

59

H1

4

456

1630

4

200

4336

5839

1552

116

093

5471

5471

1286

4

13

aH

5a

5H

5

15

709

709

1608

1618

916

189

14

239

1636

216

362

1648

2 44+ C

152

152

214

6263

6263

8668

1404

016

300

3915

4727

9380

1058

916

129

1624

9

16

aH

6a

bH

6b

6H

617

55 57 1117

3847

6253

1099

3

21

H1

5

1651

916

519

152

152 7272 183

183

1598

1598

1606

616

239

60

3460

3786

1153

6

61

1636

216

362

62

7373 8557

8557

9368

1235

816

145

28

6908

7711

1551

916

291

1629

1

29

3591

4310

9148

1302

016

168

1616

8

30 H9

3010

6776

7373

6320

8468

9921

1497

816

051

161

1808

5460

1378

215

817

1631

8

32

d522

d522

-523523

2483

3796

5

236

709

709

1900

5899

+C60

4

228

523+

CA

523

CA

1129

916

233

34

368

1037

4769

152

152

1081

016

274

1842

1123

313

708

1432

316

291

1629

123

2H

224

H2c

1438

152

152

319

8598

1328

113

928

1392

816

266

1631

116

311

1636

216

362

1651

916

519

22

93

218

318

370

9

1295

7

7272 5019

519

515

55555

1555

1555

1623

416

234

1629

0

53

4793

185

1719

8573

1310

514

560

1621

3

1598

1598

6296

A16

265

26

7H

7

25

48

195

961G

8448

8898

1375

916

278

1627

816

311

1631

1

2392

6719

9530

1263

316

209

1620

916

399

252

2308

1036

1

19

54

H1

1

146

709

709

1310

1C16

111

1616

716

288

1636

216

362

3936

1455

216

287

18

55

H8

20

195

195

4216

5378

1447

0A14

548

1611

4

H1

031

2259

4745

1368

014

872

9393 7337

573+

C16

519

1651

974

71+C

9449

1156

3 2

2706

7028

*

275348

1235

113

266C

60+T 64 152

152

153

2355

2442

3438

3847

1072

813

188

1567

416

126

1636

216

362

150

150

3290

5134

6263

6263

9585

1269

6

2758

3834

6317

7094

1035

611

252

1616

816

168

437

1167

414

800

1632

0

(pre

-HV

--)1

HV

1

HV

*VV

V

2

3

1

7

195

195

523+

CA

523

CA

5093

6059

7762

1171

911

719

1393

3

5

727216

298

pre

*V1

**

1590

4

5581

8557

8557

1522

116

222

6

pre

*2

V2

**

pre

-V

8014

T15

218

1606

7 750

7569

8376

9755

1353

516

519

1651

9

4

4919

6285

1273

214

299

1624

116

311

237

1555

3531

4715

5201

8838

1045

412

362

1273

013

928

1633

5

10

9

4639

8869

1037

9

8

4580

aH

1a

3316

362

1636

2

H1

b

36

aH

3a

1631

116

311

3H

13

3H

3H

158

H1

3a

1635

6

162

1616

216

259

1625

932

796

3796

5899

+2C

7870

8348

9022

9022

9022

1256

116

189

1618

916

356

1635

6

C60

4016

294

1629

4

3535

8810

003

1629

116

291

38

723

723

7271

7271

8952

1154

916

311

1631

1

393939

1428

78

3666

3666

1171

911

719

4062

4062

1629

416

294

4041

1623

416

234

63

42

573+

573+

3C3C13

943

43

1504

715

047

1618

916

189

939393

95C

95C

1555

1555

1555

8258

1590

2

454545

5471

5471

5471

5471

1479

8

46

152

152

4679

4679

1287

912

879

1287

913

404

1415

216

239G

1631

116

311

1631

116

311

1631

1

4747

73737373 761

1432

5

44

709

709

709

709

2581

3387

G59

11

494949

150

150

150

1536

1066

714

467

115 1420

016

176

1651

9

52525151

50

5

H1

273 13

042

1304

213

326

1332

61 13

542

1571

215

712

1627

816

278

1627

85656

575757

1635

6

C3992T T5004CG9123A

A4024GA14582G

C14365T

G8269A

A10044G

T10034C

T10007C

A1656GG11440A

T14325C

A15244G

960.XC T7870C

G13708A

T10124CT14956C

A6040G

G13889A

G5773A

G14569A

T9615C

A12642GG15884A

G6951A

T8380C

G15497AG15930A

T7581C

G7356A G7521A!

T10166CG9276A

A73G!

C16287T

T195C!

C16286g

A153G (T195C)

(T16093C)

A73G! C16248T

H4a1

c

H4a1

c1

H4a1

d

H4b1

H4c

H4c1

H4a1

a3

H4a1

a3a

H4a1

a4

H4a1

a4a

H4a1

a4b

H4a1

a4b1

H4a1

a4b2

H4a1

a5

H4a1

a1a1

H4a1

a1a1a

H4a1

a1a1a1

H4a1

a1a2

H4a1

a1a3

H4a1

a1a4

H4a1

a2

H4a1

a2a

H4a1

a2a1

H4a1

c

H4a1

c1

H4a1

d

H4b1

H4c

H4c1

H4a1

a3

H4a1

a3a

H4a1

a4

H4a1

a4a

H4a1

a4b

H4a1

a4b1

H4a1

a4b2

H4a1

a5

H4a1

a1a1

H4a1

a1a1a

H4a1

a1a1a1

H4a1

a1a2

H4a1

a1a3

H4a1

a1a4

H4a1

a2

H4a1

a2a

H4a1

a2a1

H4b

H4

H4a

H4a1

H4a1

a

H4a1

a1

H4a1

a1a

H4b

H4

H4a

H4a1

H4a1

a

H4a1

a1

H4a1

a1a

Figure 3. Haplogroup H4 internal cladistic structure(Left) Haplogroup H4 as first reported.41 Mutations in bold were considered diagnostic for the haplogroup.(Right) Haplogroup H4 as currently resolved with a total of 236 H4mitogenomes. An almost perfect resolution of the nested hierarchy isachieved. Additional haplogroups suggested herein are shown in yellow. Control-region mutations are noted in blue.

The American Journal of Human Genetics 90, 675–684, April 6, 2012 681

switching the mtDNA reference to an ancestral RSRS will

end an academically inadmissible conjuncture where

virtually all mitochondrial genome sequences are scored

in part from derived-to-ancestral states and in part from

ancestral-to-derived states. We aim to trigger the radical

but necessary change in the way mtDNA mutations are

reported relative to their ancestral versus derived status,

thus establishing an intellectual cohesiveness with the

current consensus of shared common ancestry of all con-

temporary human mitochondrial genomes.

Note that the problem is not restricted to mtDNA.

Indeed, in themuch larger perspective of complete nuclear

genomes in which comparisons are often currently made

relative to modern human reference sequences, often of

European origin, it seems worthwhile to begin consid-

ering, as valuable alternatives, public reference sequences

of ancestral alleles (common in all primates) whereby

derived alleles (common to some human populations)

would be distinguished.

Supplemental Data

Supplemental Data include two figures, six tables, and one

sequence and can be found with this article online at http://

www.cell.com/AJHG/.

Acknowledgments

We thank the genealogical community for donating their

privately obtained complete mtDNA sequences for scientific

studies and FamilyTreeDNA for compiling the data. We thank

FamilyTreeDNA for supporting the establishment of the herein

released website. We thank Eileen Krauss-Murphy of Family-

TreeDNA for help with assembly of the database. We thank

Rebekah Canada and William R. Hurst for help with the assembly

of haplogroup H and K samples, respectively. R.V. and D.M.B.

thank the European Commission, Directorate-General for

Research for FP7 Ecogene grant 205419. D.M.B. is a shareholder

of FamilyTreeDNA and a member of its scientific advisory board.

R.V. and M.M. thank the European Union, Regional Development

Fund for a Centre of Excellence in Genomics grant, and R.V.

thanks the Swedish Collegium for Advanced Studies for support

during the initial stage of this study. M.M. thanks Estonian Science

Foundation for grant 8973. A.T. received support from Fondazione

Alma Mater Ticinensis and the Italian Ministry of Education,

University and Research: Progetti Ricerca Interesse Nazionale

2009. S.R. thanks the Israeli Science Foundation for grant 1227/

09 and IBM for an Open Collaborative Research grant. FCT, the

Portuguese Foundation for Science and Technology, partially sup-

ported this work through the personal grant N.M.S. (SFRH/BD/

69119/2010). Instituto de Patologia e Imunologia Molecular da

Universidade do Porto is an Associate Laboratory of the Portuguese

Ministry of Science, Technology and Higher Education and is

partially supported by the Portuguese Foundation for Science

and Technology.

Received: January 9, 2012

Revised: February 22, 2012

Accepted: March 2, 2012

Published online: April 5, 2012

Web Resources

The URLs for data presented herein are as follows:

FASTmtDNA, http://www.mtdnacommunity.org

mtDNAble, http://www.mtdnacommunity.org

mtPhyl, http://eltsov.org/mtphyl.aspx

PhyloTree, http://www.phylotree.org

Accession Numbers

The 4,265 complete mtDNA sequences reported herein have been

submitted to GenBank (accession numbers JQ701803–JQ706067).

References

1. Darwin, C. (1859). Natural Selection. On the Origin of

Species by Means of Natural Selection, or, The Preservation

of Favoured Races in the Struggle for Life, Chapter 4 (London:

John Murray).

2. Delsuc, F., Brinkmann, H., and Philippe, H. (2005). Phyloge-

nomics and the reconstruction of the tree of life. Nat. Rev.

Genet. 6, 361–375.

3. Kivisild, T., Metspalu, E., Bandelt, H.J., Richards, M., and

Villems, R. (2006). The world mtDNA phylogeny. In Human

mitochondrial DNA and the evolution of Homo sapiens, H.J.

Bandelt, V. Macaulay, and M. Richards, eds. (Berlin: Springer-

Verlag), pp. 149–179.

4. Ingman, M., Kaessmann, H., Paabo, S., and Gyllensten, U.

(2000). Mitochondrial genome variation and the origin of

modern humans. Nature 408, 708–713.

5. Richards, M., and Macaulay, V. (2001). The mitochondrial

gene tree comes of age. Am. J. Hum. Genet. 68, 1315–1320.

6. Torroni, A., Achilli, A., Macaulay, V., Richards, M., and

Bandelt, H.J. (2006). Harvesting the fruit of the human

mtDNA tree. Trends Genet. 22, 339–345.

7. van Oven, M., and Kayser, M. (2009). Updated comprehensive

phylogenetic tree of global human mitochondrial DNA

variation. Hum. Mutat. 30, E386–E394.

8. Underhill, P.A., and Kivisild, T. (2007). Use of y chromosome

and mitochondrial DNA population structure in tracing

human migrations. Annu. Rev. Genet. 41, 539–564.

9. Salas, A., Bandelt, H.J., Macaulay, V., and Richards, M.B.

(2007). Phylogeographic investigations: The role of trees in

forensic genetics. Forensic Sci. Int. 168, 1–13.

10. Shriver, M.D., and Kittles, R.A. (2004). Genetic ancestry and

the search for personalized genetic histories. Nat. Rev. Genet.

5, 611–618.

11. Taylor, R.W., and Turnbull, D.M. (2005). Mitochondrial DNA

mutations in human disease. Nat. Rev. Genet. 6, 389–402.

12. Gilbert,M.T.,Kivisild,T.,Grønnow,B.,Andersen, P.K.,Metspalu,

E., Reidla,M., Tamm, E., Axelsson, E., Gotherstrom,A., Campos,

P.F., et al. (2008). Paleo-Eskimo mtDNA genome reveals matri-

lineal discontinuity in Greenland. Science 320, 1787–1789.

13. Gilbert, M.T., Hansen, A.J., Willerslev, E., Rudbeck, L., Barnes,

I., Lynnerup, N., and Cooper, A. (2003). Characterization of

genetic miscoding lesions caused by postmortem damage.

Am. J. Hum. Genet. 72, 48–61.

14. Haak, W., Forster, P., Bramanti, B., Matsumura, S., Brandt, G.,

Tanzer, M., Villems, R., Renfrew, C., Gronenborn, D., Alt,

K.W., and Burger, J. (2005). Ancient DNA from the first Euro-

pean farmers in 7500-year-old Neolithic sites. Science 310,

1016–1018.

682 The American Journal of Human Genetics 90, 675–684, April 6, 2012

15. Denaro, M., Blanc, H., Johnson, M.J., Chen, K.H., Wilmsen,

E., Cavalli-Sforza, L.L., and Wallace, D.C. (1981). Ethnic vari-

ation in Hpa 1 endonuclease cleavage patterns of human

mitochondrial DNA. Proc. Natl. Acad. Sci. USA 78, 5768–5772.

16. Brown,W.M. (1980). Polymorphism inmitochondrial DNA of

humans as revealed by restriction endonuclease analysis. Proc.

Natl. Acad. Sci. USA 77, 3605–3609.

17. Cann, R.L., Stoneking, M., and Wilson, A.C. (1987). Mito-

chondrial DNA and human evolution. Nature 325, 31–36.

18. Vigilant, L., Stoneking, M., Harpending, H., Hawkes, K., and

Wilson, A.C. (1991). African populations and the evolution

of human mitochondrial DNA. Science 253, 1503–1507.

19. Richards, M., Corte-Real, H., Forster, P., Macaulay, V.,

Wilkinson-Herbots, H., Demaine, A., Papiha, S., Hedges, R.,

Bandelt, H.J., and Sykes, B. (1996). Paleolithic and neolithic

lineages in the European mitochondrial gene pool. Am. J.

Hum. Genet. 59, 185–203.

20. Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral, P.,

Sellitto, D., Rengo, C., Forster, P., Savontaus, M.L., Bonne-

Tamir, B., and Scozzari, R. (1998). mtDNA analysis reveals

a major late Paleolithic population expansion from south-

western to northeastern Europe. Am. J. Hum. Genet. 62,

1137–1152.

21. Torroni, A., Schurr, T.G., Cabell, M.F., Brown, M.D., Neel, J.V.,

Larsen, M., Smith, D.G., Vullo, C.M., and Wallace, D.C.

(1993). Asian affinities and continental radiation of the four

founding Native American mtDNAs. Am. J. Hum. Genet. 53,

563–590.

22. Behar, D.M., Villems, R., Soodyall, H., Blue-Smith, J., Pereira,

L., Metspalu, E., Scozzari, R., Makkan, H., Tzur, S., Comas,

D., et al; Genographic Consortium. (2008). The dawn of

human matrilineal diversity. Am. J. Hum. Genet. 82, 1130–

1140.

23. Briggs, A.W., Good, J.M., Green, R.E., Krause, J., Maricic, T.,

Stenzel, U., Lalueza-Fox, C., Rudan, P., Brajkovic, D., Kucan,

Z., et al. (2009). Targeted retrieval and analysis of five Nean-

dertal mtDNA genomes. Science 325, 318–321.

24. Green, R.E., Malaspinas, A.S., Krause, J., Briggs, A.W., Johnson,

P.L., Uhler, C., Meyer, M., Good, J.M., Maricic, T., Stenzel, U.,

et al. (2008). A complete Neandertal mitochondrial genome

sequence determined by high-throughput sequencing. Cell

134, 416–426.

25. Kivisild, T., Shen, P., Wall, D.P., Do, B., Sung, R., Davis, K.,

Passarino, G., Underhill, P.A., Scharfe, C., Torroni, A., et al.

(2006). The role of selection in the evolution of human mito-

chondrial genomes. Genetics 172, 373–387.

26. Kivisild, T., Reidla, M., Metspalu, E., Rosa, A., Brehm, A.,

Pennarun, E., Parik, J., Geberhiwot, T., Usanga, E., and

Villems, R. (2004). Ethiopian mitochondrial DNA heritage:

Tracking gene flow across and around the gate of tears. Am.

J. Hum. Genet. 75, 752–770.

27. Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H.,

Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe,

B.A., Sanger, F., et al. (1981). Sequence and organization of

the human mitochondrial genome. Nature 290, 457–465.

28. Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N.,

Turnbull, D.M., and Howell, N. (1999). Reanalysis and

revision of the Cambridge reference sequence for human

mitochondrial DNA. Nat. Genet. 23, 147.

29. Yao, Y.G., Salas, A., Bravi, C.M., and Bandelt, H.J. (2006).

A reappraisal of completemtDNAvariation in East Asian fami-

lies with hearing impairment. Hum. Genet. 119, 505–515.

30. Pello, R., Martın, M.A., Carelli, V., Nijtmans, L.G., Achilli, A.,

Pala, M., Torroni, A., Gomez-Duran, A., Ruiz-Pesini, E., Marti-

nuzzi, A., et al. (2008). Mitochondrial DNA background

modulates the assembly kinetics of OXPHOS complexes in

a cellular model of mitochondrial disease. Hum. Mol. Genet.

17, 4001–4011.

31. Bandelt, H.J., and Parson, W. (2008). Consistent treatment

of length variants in the human mtDNA control region:

A reappraisal. Int. J. Legal Med. 122, 11–21.

32. Soares, P., Ermini, L., Thomson, N., Mormina, M., Rito, T.,

Rohl, A., Salas, A., Oppenheimer, S., Macaulay, V., and Ri-

chards, M.B. (2009). Correcting for purifying selection: An

improved human mitochondrial molecular clock. Am. J.

Hum. Genet. 84, 740–759.

33. Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum

likelihood. Mol. Biol. Evol. 24, 1586–1591.

34. Tang, S., and Huang, T. (2010). Characterization of mitochon-

drial DNA heteroplasmy using a parallel sequencing system.

Biotechniques 48, 287–296.

35. Li, M., Schonberg, A., Schaefer, M., Schroeder, R., Nasidze, I.,

and Stoneking, M. (2010). Detecting heteroplasmy from

high-throughput sequencing of complete human mitochon-

drial DNA genomes. Am. J. Hum. Genet. 87, 237–249.

36. Zaragoza, M.V., Fass, J., Diegoli, M., Lin, D., and Arbustini, E.

(2010). Mitochondrial DNA variant discovery and evaluation

in human Cardiomyopathies through next-generation

sequencing. PLoS ONE 5, e12295.

37. Mishmar, D., Ruiz-Pesini, E., Golik, P., Macaulay, V., Clark,

A.G., Hosseini, S., Brandon, M., Easley, K., Chen, E., Brown,

M.D., et al. (2003). Natural selection shaped regional mtDNA

variation in humans. Proc. Natl. Acad. Sci. USA 100, 171–176.

38. Green, R.E., Krause, J., Briggs, A.W., Maricic, T., Stenzel, U.,

Kircher, M., Patterson, N., Li, H., Zhai, W., Fritz, M.H., et al.

(2010). A draft sequence of the Neandertal genome. Science

328, 710–722.

39. Richards, M.B., Macaulay, V.A., Bandelt, H.J., and Sykes, B.C.

(1998). Phylogeography of mitochondrial DNA in western

Europe. Ann. Hum. Genet. 62, 241–260.

40. Torroni, A., Rengo, C., Guida, V., Cruciani, F., Sellitto, D.,

Coppa, A., Calderon, F.L., Simionati, B., Valle, G., Richards,

M., et al. (2001). Do the four clades of the mtDNA haplogroup

L2evolve at different rates?Am. J.Hum.Genet.69, 1348–1356.

41. Achilli, A., Rengo, C., Magri, C., Battaglia, V., Olivieri, A., Scoz-

zari, R., Cruciani, F., Zeviani, M., Briem, E., Carelli, V., et al.

(2004). The molecular dissection of mtDNA haplogroup H

confirms that the Franco-Cantabrian glacial refugewas amajor

source for the European gene pool. Am. J. Hum. Genet. 75,

910–918.

42. Parson, W., and Bandelt, H.J. (2007). Extended guidelines for

mtDNA typing of population data in forensic science. Forensic

Sci. Int. Genet. 1, 13–19.

43. Salas, A., Carracedo, A., Macaulay, V., Richards, M., and

Bandelt, H.J. (2005). A practical guide to mitochondrial DNA

error prevention in clinical, forensic, and population genetics.

Biochem. Biophys. Res. Commun. 335, 891–899.

44. Bandelt, H.J., Lahermo, P., Richards, M., and Macaulay, V.

(2001). Detecting errors in mtDNA data by phylogenetic

analysis. Int. J. Legal Med. 115, 64–69.

45. Ballantyne, K.N., vanOven,M., Ralf, A., Stoneking,M., Mitch-

ell, R.J., van Oorschot, R.A., and Kayser, M. (2011). MtDNA

SNP multiplexes for efficient inference of matrilineal genetic

ancestry within Oceania. Forensic Sci. Int. Genet., in press.

The American Journal of Human Genetics 90, 675–684, April 6, 2012 683

Published online September 20, 2011. 10.1016/j.fsigen.2011.

08.010.

46. Pereira, L., Soares, P., Radivojac, P., Li, B., and Samuels, D.C.

(2011).Comparing phylogeny and thepredictedpathogenicity

of protein variations reveals equal purifying selection across

the global human mtDNA diversity. Am. J. Hum. Genet. 88,

433–439.

47. Behar, D.M., Harmant, C., Manry, J., van Oven, M., Haak, W.,

Martinez-Cruz, B., Salaberria, J., Oyharcabal, B., Bauduer, F.,

Comas, D., and Quintana-Murci, L.; Consortium. TG.

(2012). The Basque paradigm: Genetic evidence of a maternal

continuity in the Franco-Cantabrian Region since pre-

Neolithic times. Am. J. Hum. Genet. 90, 486–493.

48. Zeviani, M., and Carelli, V. (2007). Mitochondrial disorders.

Curr. Opin. Neurol. 20, 564–571.

49. Gunnarsdottir, E.D., Nandineni, M.R., Li, M., Myles, S., Gil,

D., Pakendorf, B., and Stoneking, M. (2011). Larger mitochon-

drial DNA than Y-chromosome differences betweenmatrilocal

and patrilocal groups from Sumatra. Nat. Commun. 2, 228.

50. Baum, D.A., Smith, S.D., and Donovan, S.S. (2005). Evolution.

The tree-thinking challenge. Science 310, 979–980.

51. Behar, D.M., Metspalu, E., Kivisild, T., Rosset, S., Tzur, S.,

Hadid, Y., Yudkovsky, G., Rosengarten, D., Pereira, L.,

Amorim, A., et al. (2008). Counting the founders: The matri-

lineal genetic ancestry of the Jewish Diaspora. PLoS ONE 3,

e2062.

52. Wallace, D.C., Singh, G., Lott, M.T., Hodge, J.A., Schurr, T.G.,

Lezza, A.M., Elsas, L.J., 2nd, and Nikoskelainen, E.K. (1988).

Mitochondrial DNA mutation associated with Leber’s heredi-

tary optic neuropathy. Science 242, 1427–1430.

53. MITOMAP. (2011) A Human Mitochondrial Genome Data-

base. http://www.mitomap.org.

54. Quintana-Murci, L., Harmant, C., Quach, H., Balanovsky, O.,

Zaporozhchenko, V., Bormans, C., van Helden, P.D., Hoal,

E.G., and Behar, D.M. (2010). Strongmaternal Khoisan contri-

bution to the South African coloured population: A case of

gender-biased admixture. Am. J. Hum. Genet. 86, 611–620.

55. Schonberg, A., Theunert, C., Li, M., Stoneking, M., and

Nasidze, I. (2011). High-throughput sequencing of complete

human mtDNA genomes from the Caucasus and West Asia:

High diversity and demographic inferences. Eur. J. Hum.

Genet. 19, 988–994.

684 The American Journal of Human Genetics 90, 675–684, April 6, 2012

REPORT

Mutations in the GlycosylphosphatidylinositolGene PIGL Cause CHIME Syndrome

Bobby G. Ng,1 Karl Hackmann,2 Melanie A. Jones,3 Alexey M. Eroshkin,1 Ping He,1 Roy Wiliams,1

Shruti Bhide,3 Vincent Cantagrel,4 Joseph G. Gleeson,4 Amy S. Paller,5 Rhonda E. Schnur,6

Sigrid Tinschert,2 Janice Zunich,7 Madhuri R. Hegde,3 and Hudson H. Freeze1,*

CHIME syndrome is characterized by colobomas, heart defects, ichthyosiform dermatosis, mental retardation (intellectual disability),

and ear anomalies, including conductive hearing loss. Whole-exome sequencing on five previously reported cases identified PIGL,

the de-N-acetylase required for glycosylphosphatidylinositol (GPI) anchor formation, as a strong candidate. Furthermore, cell lines

derived from these cases had significantly reduced levels of the two GPI anchor markers, CD59 and a GPI-binding toxin, aerolysin

(FLAER), confirming the pathogenicity of the mutations.

CHIME syndrome (MIM 280000), also known as Zunich

neuroectodermal syndrome, is an extremely rare auto-

somal recessive multisystemic disorder clinically character-

ized by colobomas, congenital heart defects, early onset

migratory ichthyosiform dermatosis, mental retardation

(intellectual disability), and ear anomalies, including

conductive hearing loss. Other clinical manifestations

include distinctive facial features, abnormal growth, geni-

tourinary abnormalities, seizures, and feeding difficul-

ties.1 To date, eight cases have been reported, all having

nearly identical phenotypes.

In 2010, Cantagrel et al.2 described a congenital disorder

of glycosylation (CDG) in which individuals with patho-

logical mutations in SRD5A3 (MIM 611715) presented

with CHIME-like features (MIM 612379). Yet individuals

with Classical CHIME as described by Zunich1 lacked

mutations in SRD5A3. We hypothesized that CHIME

syndrome could be a glycosylation disorder on the basis

of the clinical similarity to those individuals identified by

Cantagrel.2

To test this, we obtained DNA samples from six of the

eight previously described cases from five unrelated fami-

lies for the purpose of whole-exome sequencing (WES).

All clinical samples were obtained with proper informed

consent in accordance with the Sanford-BurnhamMedical

Research Institute’s institutional review board consent

guidelines.

To identify the genetic cause of CHIME syndrome,

we performed WES on five of six previously described

cases.3,4 Exome sequences were enriched with the Roche

Nimblegen Seqcap EZ whole-exome Ver 2.0 on an Illumina

HiSeq platform, and the raw data were aligned to hg18.

Analysis employed Agilent’s AVADIS NGS software.

The five sequenced exomes had an average of 12,549

total variants and an average of 3,774 novel variants with

87% of the exome targets having at least 103 coverage

(Table 1).

A sixth case previously described by Tinschert et al.5 was

the only sample analyzed by comparative genomic hybrid-

ization (CGH) array; it showed a 1MBmaternally inherited

deletion on chromosome 17. DNA was isolated from

whole blood with QIAGEN’s DNA blood kit according to

the manufacturer’s protocol (QIAGEN, Hilden, Germany).

Array CGH was performed on Agilent’s SurePrint G3

Human CGH Microarray Kit 2x400K (Design ID 021850,

Agilent, Santa Clara, CA, USA) according to the manufac-

turer’s protocol, except that dyes were used inversely on

sample and reference. An Agilent microarray scanner

provided the raw data that were processed by Feature

Extraction 9.5. Deleted and amplified regions were identi-

fied on Agilent’s Genomic Workbench Standard Edition

5.0.14. Customized CGH array confirmed copy number

variants and familial segregation. Agilent’s eArray platform

had a general probe density of 1 per 200 bp to 1 per 2.5 kb

depending on the size of the variant. The coordinates of

the array result were mapped to hg18.

We focused exome analysis on the 17 genes in this

region (chromosome 17: 15,620,754–16,698,489 hg18) as

likely candidates. We excluded synonymous changes, vari-

ants in dbSNP v133, and variants present in a limited (30)

in-house exome library. All cases had compound heterozy-

gous mutations in only PIGL [RefSeq NM_004278.3]

within that region (Figure 1). Sanger sequencing confirmed

all mutations in PIGL, and carrier status in each available

parent or sibling (family 665 was not available) excluded

de novo events.

PIGL (NP_004269 [MIM 605947]) is an endoplasmic

reticulum (ER)-localized enzyme that catalyzes the second

1Genetic Disease Program, Sanford Children’s Health Research Center, Sanford-BurnhamMedical Research Institute, La Jolla, CA 92037, USA; 2Institut fuer

Klinische Genetik, Medizinische Fakultaet Carl Gustav Carus, Technische Universitaet Dresden, 01307 Dresden, Germany; 3Department of Human

Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA; 4Neurogenetics Laboratory, Institute for Genomic Medicine, Howard Hughes

Medical Institute, Department of Neurosciences and Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA; 5Department of Dermatology,

Northwestern University, Feinberg School of Medicine, Chicago, IL 60611, USA; 6Division of Genetics, Department of Pediatrics, Cooper Medical School of

Rowan University, Camden, NJ 08103, USA; 7Genetics Center, Indiana University School of Medicine–Northwest, Gary, IN 46408, USA

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.010. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 685–688, April 6, 2012 685

step of glycosylphosphatidylinositol (GPI) biosynthesis,

the de-N-acetylation of N-acetylglucosaminyl-phosphati-

dylinositol (GlcNAc-PI / GlcN-PI) that occurs on the

cytoplasmic side of the ER.6 Following de-N-acetylation,

glucosaminyl-phosphatidylinositol (GlcN-PI) flips to the

luminal side of the ER where GlcN-PI undergoes further

extensions prior to its transfer to acceptor proteins.7

Aside from a possible founder missense mutation, the

other mutations identified in our six CHIME cases are

predicted to be highly damaging (frameshift, nonsense,

essential splice site, and entire gene deletion) (Table 2).

The c.500T>C (p.Leu167Pro) mutation found in all six

cases is at a highly conserved residue (Table 3) located in

the catalytic domain and predicted by both PolyPhen

and SIFT to be damaging.

Utilizing two large public databases, we found the

heterozygous missense mutation c.500T>C in eight out

of nearly 13,000 alleles. In the National Heart, Lung, and

Blood Institute (NHLBI) Exome Sequencing Project, the

c.500T>C mutation appears at a frequency of 6:10,752

alleles (5,376 genomes; with all six heterozygotes being

of European origin). In the 1000 Genomes database, it

was present at 2:2,188 alleles (1,094 genomes). Given the

relatively rare frequency (<0.1%) and the fact that all six

CHIME cases are of European ancestry, we hypothesized

that c.500T>C was due to a founder mutation.

We compared twomicrosatellite markers, 17xATT (trinu-

cleotide marker) and 19xGT (dinucleotide marker), on

chromosome 17: 15,620,754–16,698,489 (hg18) flanking

PIGL. The c.500T>C missense mutation was tightly linked

with the 17xATT in all CHIME cases with an ATT repeat

size of 22, whereas European controls (16) had a repeat

size of 17 repeats. Furthermore, all CHIME cases had a

19xGT repeat size of 13, whereas the European controls

(15) had a repeat size of 19. These results support linkage

disequilibrium between the c.500T>C allele and allele 22

of the 17xATT and allele 13 of the 19xGT marker (data

not shown).

To confirm that the mutations in PIGL are pathological,

we utilized a primary fibroblast cell line from individual

3988 and an Epstein-Barr virus (EBV)-transformed

lymphoblast cell line from individual 33300 to measure

two separate cell surface GPI-anchor-containing markers,

CD59 and a GPI-binding toxin, aerolysin (FLAER). In

agreement with other proven GPI deficiencies, cells

available from CHIME syndrome cases are also deficient

for both GPI anchor markers (Figure 2). It is important

to note that individual 33300 carries the chromo-

some 17 deletion and c.500T>C mutation, making her

hemizygous for the mutation and proving that it is

pathogenic.

Three inherited genetic disorders were previously identi-

fied in GPI biosynthetic genes: PIGM (MIM 610273), PIGN

(MIM 606097), and PIGV (MIM 610274).8–10 Somatic

mutations in PIGA (MIM 311770) cause paroxysmal

nocturnal hemoglobinuria (MIM 300818), a hematologic

disorder.11 Interestingly, epidermal-specific knockout

of mouse PigA recreates features of human Harlequin

Table 1. Summary of WES Statistics for Five CHIME Syndrome Cases

Individual

Average680-2-2 680-2-3 682-2-1 665-3-1 3988

Total number of sequenced reads 53,341,240 54,687,243 47,134,146 117,769,811 121,450,190 78,876,526

Total number of unmapped reads (%) 1,257,149 (2.4%) 765,214 (1.4%) 738,306 (1.6%) 934,246 (0.8%) 988,578 (0.8%) 936,699 (1.2%)

Total number of mapped reads 52,084,091 53,922,029 46,395,840 116,835,565 120,461,612 77,939,827

Percentage Targets with 103 coverage 78 88 85 88 95 87

Percentage Targets with 203 coverage 43 65 59 79 89 67

Total NS-SS-Indels 10,022 12,211 11,921 15,614 12,978 12,549

Novel NS-SS-Indels 2,338 3,211 3,181 7,173 2,966 3,774

Figure 1. Organization of Human PIGL andAssociated MutationsSchematic representation showing both genomicand protein organization of human PIGL withcorresponding mutations as well as functionalprotein domains including a 20aa transmem-brane domain (TMD) and the core de-N-acetylasedomain.

686 The American Journal of Human Genetics 90, 685–688, April 6, 2012

ichthyosis.12 Mutations in PIGM, encoding the first man-

nosyltransferase, are associated with venous thrombosis

and seizures (MIM 610293).8 Mutations in PIGN, which

encodes the ethanolamine phosphate transferase, cause

multiple congenital anomalies-hypotonia-seizures

syndrome (MIM 614080).9 Deficiencies in PIGV, which

encodes the second mannosyltransferase in GPI forma-

tion, are associated with the hyperphosphatasia mental

retardation syndrome (MIM 239300).10

One clear difference between PIGL deficiency and the

other GPI deficiencies is the magnitude of GPI marker

decrease. Although we have consistently seen a significant

decrease of 2- to 4-fold in both cell lines, the decrease was

less dramatic than those seen in the other PIG defi-

ciencies. One explanation is that the p.Leu167Pro alter-

ation mildly affects the binding and de-N-acetylation of

the GlcNAc-PI, but GlcN-PI can still be flipped into the

luminal side of the ER. Furthermore, although we show

a link between the p.Leu167Pro alteration and CHIME

syndrome, we cannot exclude the possibility that addi-

tional mutations in PIGL could cause disorders other

than CHIME syndrome.

GPI anchor deficiencies cause remarkable clinical diver-

sity but that is typical of other glycosylation pathways

such as the 38 Congenital Disorders of Glycosylation or

6 a-dystroglycanopathies.13 Hypomorphic alleles predom-

inate, and the clinical impact often depends more on

the severity of the mutation than on the specific mutated

gene.14 In conclusion; we analyzed six previously

described CHIME syndrome cases by using a combination

of genetic and biochemical approaches, including CGH

array and WES. We show that mutations in PIGL impair

GPI biosynthesis and are the underlying cause of this

disorder.

Table 3. Evolutionary Conservation of Leu167

Species Ortholog of Human Leu167

Homo sapiens YAAVRA L HSEGK

Mus musculus YKAVRA L HSGGK

Rattus norvegicus YKAVRA L HSGGK

Danio rerio YKTLSH L ASAGR

Saccharomyces cerevisiae YAAVKK L VDDYA

Gallus gallus YAAVRA L HSEGK

Drosophila melanogaster YAAAS L CLANL

Table 2. Mutations Identified in Each of the Five CHIME Families

IndividualDNA LevelMutations

Protein LevelAlterations Reference

680-2-2 c.274delC,c.500T>C

p.Leu92Phefs*15,p.Leu167Pro

Shashi et al.3

680-2-3 c.274delC,c.500T>C

p.Leu92Phefs*15,p.Leu167Pro

Shashi et al.3

682-2-1 c.500T>C,c.652C>T

p.Leu167Pro,p.Gln218*

Shashi et al.3

665-3-1 c.500T>C p.Leu167Proa Schnur15

3988 c.427-1G>A,c.500T>C

c.427-1G>A,p.Leu167Pro

Sidbury4

33300 del17p12-p11.2,c.500T>C

del17p12-p11.2,p.Leu167Pro

Tinschert5

aSecond mutation not identified; parent DNA unavailable.

Figure 2. Cell Surface Expression of Total GPI Anchor and CD59Fluorescence-activated cell sorting analysis for two separate GPI anchormarkers, CD59 and FLAER, were used on a primary fibroblast linefrom individual 3988 and an EBV transformed lymphoblast line from individual 33300 to evaluate GPI anchor levels. In both instances,two normal controls were used. Shown is a representation of the two. Dotted lines indicate isotype controls.

The American Journal of Human Genetics 90, 685–688, April 6, 2012 687

Acknowledgments

Supported by The Rocket Fund, a Sanford Professorship (H.H.F.)

and R01 DK55615. K.H. was supported by the Bundesministerium

fur Bildung und Forschung network grant MR-NET 01GS08166.

Received: January 5, 2012

Revised: January 30, 2012

Accepted: February 9, 2012

Published online: March 22, 2012

Web Resources

The URLs for data presented herein are as follows:

1000 Genomes, http://www.1000genomes.org

Agilent eARRAY, https://earray.chem.agilent.com/earray/

NHLBI Exome Sequencing Project (ESP), http://evs.gs.

washington.edu/EVS/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

UCSC Genome Browser, www.genome.ucsc.edu

References

1. Zunich, J., and Esterly, N. (2008). In CHIME syndrome

(Zunich syndrome) Neurocutaneous Disorders Phakomatoses

and Hamartoneoplastic Syndromes, M. Ruggieri, I. Pascual-

Castroviejo, C. Di Rocco, and G. Weinheim, eds. (Wien:

Springer-Verlag), pp. 949–955.

2. Cantagrel, V., Lefeber, D.J., Ng, B.G., Guan, Z., Silhavy, J.L.,

Bielas, S.L., Lehle, L., Hombauer, H., Adamowicz, M., Swiezew-

ska, E., et al. (2010). SRD5A3 is required for converting

polyprenol to dolichol and is mutated in a congenital glyco-

sylation disorder. Cell 142, 203–217.

3. Shashi, V., Zunich, J., Kelly, T.E., and Fryburg, J.S. (1995).

Neuroectodermal (CHIME) syndrome: an additional case

with long term follow up of all reported cases. J. Med. Genet.

32, 465–469.

4. Sidbury, R., and Paller, A.S. (2001). What syndrome is this?

CHIME syndrome. Pediatr. Dermatol. 18, 252–254.

5. Tinschert, S., Anton-Lamprecht, I., Albrecht-Nebe, H., and

Audring, H. (1996). Zunich neuroectodermal syndrome:

migratory ichthyosiform dermatosis, colobomas, and other

abnormalities. Pediatr. Dermatol. 13, 363–371.

6. Watanabe, R., Ohishi, K., Maeda, Y., Nakamura, N., and

Kinoshita, T. (1999). Mammalian PIG-L and its yeast homo-

logue Gpi12p are N-acetylglucosaminylphosphatidylinositol

de-N-acetylases essential in glycosylphosphatidylinositol

biosynthesis. Biochem. J. 339, 185–192.

7. Ferguson, M.A.J., Kinoshita, T., and Hart, G.W. (2009).

Glycosylphosphatidylinositol Anchors. In Essentials of Glyco-

biology, A. Varki, R.D. Cummings, J.D. Esko, H. Freeze, G.

Hart, and J. Marth, eds. (Cold Spring Harbor, NY: Cold Spring

Harbor Laboratory Press), pp. 143–161.

8. Almeida, A.M., Murakami, Y., Layton, D.M., Hillmen, P., Sell-

ick, G.S., Maeda, Y., Richards, S., Patterson, S., Kotsianidis, I.,

Mollica, L., et al. (2006). Hypomorphic promoter mutation

in PIGM causes inherited glycosylphosphatidylinositol defi-

ciency. Nat. Med. 12, 846–851.

9. Maydan, G., Noyman, I., Har-Zahav, A., Neriah, Z.B., Pasma-

nik-Chor, M., Yeheskel, A., Albin-Kaplanski, A., Maya, I.,

Magal, N., Birk, E., et al. (2011). Multiple congenital anoma-

lies-hypotonia-seizures syndrome is caused by a mutation in

PIGN. J. Med. Genet. 48, 383–389.

10. Krawitz, P.M., Schweiger, M.R., Rodelsperger, C., Marcelis, C.,

Kolsch, U., Meisel, C., Stephani, F., Kinoshita, T., Murakami,

Y., Bauer, S., et al. (2010). Identity-by-descentfilteringof exome

sequence data identifies PIGV mutations in hyperphosphata-

sia mental retardation syndrome. Nat. Genet. 42, 827–829.

11. Takeda, J., Miyata, T., Kawagoe, K., Iida, Y., Endo, Y., Fujita, T.,

Takahashi, M., Kitani, T., and Kinoshita, T. (1993). Deficiency

of the GPI anchor caused by a somatic mutation of the PIG-

A gene in paroxysmal nocturnal hemoglobinuria. Cell 73,

703–711.

12. Hara-Chikuma, M., Takeda, J., Tarutani, M., Uchida, Y., Hol-

leran, W.M., Endo, Y., Elias, P.M., and Inoue, S. (2004).

Epidermal-specific defect of GPI anchor in Pig-a null mice

results in Harlequin ichthyosis-like features. J. Invest. Derma-

tol. 123, 464–469.

13. Freeze, H.H., and Ng, B.G. (2011). Golgi glycosylation and

human inherited diseases. In Cold Spring Harb Perspect

Biol, G. Warren and J. Rothman, eds. (Cold Spring Harbor,

NY: Cold Spring Harbor Laboratory Press), pp. 35–56.

14. Godfrey, C., Foley, A.R., Clement, E., and Muntoni, F. (2011).

Dystroglycanopathies: coming into focus. Curr. Opin. Genet.

Dev. 21, 278–285.

15. Schnur, R.E., Greenbaum, B.H., Heymann, W.R., Christensen,

K., Buck, A.S., and Reid, C.S. (1997). Acute lymphoblastic

leukemia in a child with the CHIME neuroectodermal

dysplasia syndrome. Am. J. Med. Genet. 72, 24–29.

688 The American Journal of Human Genetics 90, 685–688, April 6, 2012

REPORT

SKIV2L Mutations Cause Syndromic Diarrhea,or Trichohepatoenteric Syndrome

Alexandre Fabre,1,2 Bernard Charroux,3 Christine Martinez-Vinson,4 Bertrand Roquelaure,2

Egritas Odul,5 Ersin Sayar,6 Hilary Smith,7 Virginie Colomb,8 Nicolas Andre,9 Jean-Pierre Hugot,4

Olivier Goulet,8 Caroline Lacoste,10 Jacques Sarles,2 Julien Royet,3 Nicolas Levy,1,10

and Catherine Badens1,10,*

Syndromic diarrhea (or trichohepatoenteric syndrome) is a rare congenital bowel disorder characterized by intractable diarrhea and

woolly hair, and it has recently been associated with mutations in TTC37. Although databases report TTC37 as being the human ortho-

log of Ski3p, one of the yeast Ski-complex cofactors, this lead was not investigated in initial studies. The Ski complex is a multiprotein

complex required for exosome-mediated RNA surveillance, including the regulation of normal mRNA and the decay of nonfunctional

mRNA. Considering the fact that TTC37 is homologous to Ski3p, we explored a gene encoding another Ski-complex cofactor, SKIV2L, in

six individuals presenting with typical syndromic diarrhea without variation in TTC37. We identified mutations in all six individuals.

Our results show that mutations in genes encoding cofactors of the human Ski complex cause syndromic diarrhea, establishing a link

between defects of the human exosome complex and a Mendelian disease.

Syndromic diarrhea (SD) is a rare and severe disease charac-

terized by intractable diarrhea, facial dysmorphism, intra-

uterine growth retardation, immunodeficiency, and hair

abnormalities.1 Trichohepatoenteric (THE [MIM 222470])

syndrome, described initially by Verloes et al. in 19972 as

a different syndrome, has since been grouped with syn-

dromic diarrhea because the main clinical features in

both syndromes are identical3 (for simplicity, we now

refer to both syndromes as a singular disorder, SD/THE

syndrome). After a homozygosity-mapping analysis, we

and others recently identified TTC37 mutations as being

responsible for this syndrome in 21 individuals.4,5 The

precise function of TTC37 (also called Thespin) was not

elucidated even after the description of its involvement

in SD/THE. This protein shares no sequence homology

with other human proteins and shows no known func-

tional domains except several tetratrico-peptide-repeat

(TPR) domains that are structural motifs found in over

300 human proteins.6 In some databases, TTC37 is re-

ported as being the ortholog of yeast SKI3, which encodes

a key component of the Ski complex, a multiprotein

complex required for exosome-mediated RNA surveil-

lance.7 However, this lead was not explored in the initial

studies. In our series, 6 out of 15 individuals did not carry

a mutation in TTC37 but presented with typical character-

istics of SD/THE syndrome. Considering the fact that

TTC37 is homologous to Ski3p, we assumed that other

genes encoding Ski-complex proteins might be responsible

for SD/THE syndrome in these individuals; we confronted

this hypothesis with the results of the linkage analysis of

one of the consanguineous families.

Out of the six individuals with typical SD/THE syndrome

and no mutation in TTC37, only one was previously re-

ported.8 All procedures followed for clinical and genetic

analyses in this study were in accordance with the ethical

standards of the institutional and national committees on

human experimentation, and proper informed consent

was obtained from the parents of the affected children. All

individuals presented with severe and intractable diarrhea

that occurred between 1 and 12 weeks after birth, hair ab-

normalities (sparse, fragile, and uncombable hair and tri-

chorrhexis nodosa), and facial dysmorphism characterized

byhypertelorism, a broadflatnasal bridge, and aprominent

forehead (Figure 1). All children received parenteral nutri-

tion, but the amount of time varied between individuals

and ranged from a few weeks to several years. Immunodefi-

ciencywasmostly due to low immunoglobulin levels and to

the absence of an immune response to vaccines. One indi-

vidual out of the three with immunodeficiency died from

a measles infection. These six individuals harbor no muta-

tion in TTC37, but their clinical presentation is undistin-

guishable from those whohaveTTC37mutations (Table 1).

Searching for a candidate gene in this group, we investi-

gated the possible homology (reported in the Ensembl

database) between TTC37 and yeast SKI3. Interspecies

sequence-alignment analysis (with the bioinformatics

prediction software BLAST) revealed that TTC37 shares

significant amino-acid-sequence similarity with yeast

1UMR_S 910, Inserm-Faculte de Medecine, Aix-Marseille Universite, 13385 Marseille, France; 2AP-HM, Service de Pediatrie Multidisciplinaire, Hopital d’En-

fants de la Timone, 13385 Marseille, France; 3Institut de Biologie du Developpement de Marseille-Luminy, Aix-Marseille Universite, 13288 Marseille,

France; 4AP-HP, Service de Gastroenterologie, Hopital Robert Debre, 75019 Paris, France; 5Department of Pediatric Gastroenterology, Gazi University School

of Medicine, 06500 Ankara, Turkey; 6Department of Pediatric Gastroenterology, Akdeniz University, 07058 Antalya, Turkey; 7Department of Pediatrics,

Children’s Memorial Hospital, Chicago, IL 60614, USA; 8AP-HP, Service de Gastroenterologie, Hopital Necker-Enfants Malades, 75004 Paris, France;9AP-HM, Service d’Oncologie Pediatrique, Hopital d’Enfants de la Timone, 13385 Marseille, France; 10AP-HM, Laboratoire de Genetique Moleculaire,

Hopital d’Enfants de la Timone, 13385 Marseille, France

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.009. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 689–692, April 6, 2012 689

Ski3p. Between TTC37 residues 5 and 84, TTC37 and Ski3p

were 35% identical and 58% similar, and between residues

488 and 1223, they were 20% identical and 38% similar,

indicating that TTC37 is the human ortholog of SKI3.

Because Ski3p ispart of amultiprotein complex, thisfinding

prompted us to test whether mutations in other human or-

thologs of the Ski proteins could be associatedwith SD/THE

syndrome. For one consanguineous family, we performed

a linkage analysis in which the TTC37 critical interval

was excluded, and this analysis revealed that there was a

region of homozygosity spanning nucleotides 9,138,488–

40,453,008 and containing 856 genes in chromosomal

region 6p21.2–6p24.3. When looking for functional candi-

date genes in this region, we noticed that SKIV2L (RefSeq

accessionnumberNM_006929.4), a genedescribed asbeing

the human ortholog of the SKI2,9,10 was present in region

6p21.3. Thus, we performed direct sequencing of SKIV2L

in our cohort of six individuals affected by typical SD/THE

syndrome and found the presence of mutations predicted

to be deleterious in all six. The identified mutations

and their mode of inheritance are described in Table 2.

All of the unaffected parents available for analysis were

heterozygous for one of the mutations identified in

their child. We identified eight different variations dis-

tributed throughout the protein (Figure 1). Seven of these

variations correspond to nonsense or frameshift mutations

that introduce apremature terminationcodon, and theyare

Figure 1. Clinical Presentation of Individualswith SD/THE Syndrome and SKIV2L Mutations(A) Clinical presentation of four individuals withSD/THE syndrome. One individual has muta-tions in TTC37 (first picture on the left), andthree others have mutations in SKIV2L (rightpanel).(B) A schematic overview of the predicted SKIV2Lstructure with a helicase ATP-binding domainand a helicase C-terminal domain (UniProt).The location of the domains is annotated byamino acid position. The eight amino acid substi-tutions are indicated by arrows.(C) Nucleotide sequences are shown for thecontrols (upper sequence) and for each SKIV2Lvariant (lower sequence). Arrows indicate muta-tion positions on the reference sequences.

c.848G>A (p.Trp283*), c.1434del (p.Ser479-

Alafs*3), c.1635_1636insA (p.Gly546Argfs*

35), c.226C>T (p.Arg756*), c.2442G>A

(p.Trp814*), c.2572del (p.Val858*), and

c.2662_2663del (p.Arg888Glyfs*12). The

mutation types suggest that the disease

mechanism is loss of function.We identified

one missense mutation, c.1022T>G

(p.Val341Gly), that is located in the helicase

ATP-binding domain (Figure 1) and con-

cerns a highly conserved residue. PolyPhen

predicted that this specificmutation is prob-

ably damaging (the prediction score was 2.8

[mutations predicted to be pathological have an index

above 0.5]).

It isnowclear that themajorityof genomic information is

transcribed into RNAmolecules; this process generates very

abundant and complex pools of RNAs that the cells have to

control.While performing numerous RNA-processing reac-

tions, the cell must, at the same time, eradicate surplus and

aberrantmaterial.11 This task is, at least partially, performed

by the RNA exosome multiprotein complex that contains

both 30-to-50 exonuclease and 30-to-50 endonuclease activityand acts in both the nucleus and the cytoplasm.12 Initially

discovered in yeast but also present in higher eukaryotes,

including humans, this exosome complex is involved in

the decay pathways of normal mRNA but is also required

to maintain the fidelity of gene expression through RNA

surveillance processes, such as nonsense-mediated decay,

nonstop decay, and no-go decay.13–16 Genetic screens in

yeast have identified three proteins (Ski2p, Ski3p, and

Ski8p) that form the Ski (Superkiller) complex and act as

specific cofactors required for all the functions of the cyto-

plasmic exosome but not the nuclear exosome.17 The puta-

tive DExH-box RNA helicase activity of Ski2 suggests that it

is the only Ski protein with a catalytic function, whereas

Ski3p and Ski8p contain repeated domains thought to be

needed for protein-protein interactions.18 In humans, the

cytoplasmic exosome is involved in various mRNA decay

pathways and is required for normal cell growth.19

690 The American Journal of Human Genetics 90, 689–692, April 6, 2012

Here, we show that in six families, molecular defects in

the cytoplasmic-exosome cofactor SKIV2L cause SD/THE

syndrome, a severe autosomal-recessive condition mainly

characterized by intractable diarrhea and woolly hair. This

study establishes a formal link between a constitutional

congenital disease and defects of human cytoplasmic-exo-

some cofactors. Although SD/THE syndrome is genetically

heterogeneous and is associated with at least two different

genes, it is extremely homogenous at the clinical level

(Table 1), suggesting that a defect in Ski-complex function

or structure is a key mechanism responsible for the main

clinical features. Consequently,WDR61, the human ortho-

log of the third cofactor SKI8, will be a relevant candidate

gene to be tested in persons that are affected by THE

syndrome and that have no mutation in TTC37 or SKIV2L

sequences (a situation not encountered in our series).

In human pathology, the exosome complex has previ-

ously been involved in autoimmune diseases, in which

components of the nuclear or cytoplasmic exosome are

the target of autoimmune response, or in cancer.20 Here,

we point out that exosome dysfunctionmust be considered

a cause of Mendelian disorders. The association between

mutations in exosome-cofactor-encoding genes and

human diseases provides a valuablemodel for investigating

the role of this structure in human pathology but also in

normal cellular function. The mechanism by which

mRNA-surveillance defects lead to various clinical symp-

toms, such as severe diarrhea, hair abnormalities, or immu-

nodeficiency, will need to be investigated in further studies.

Acknowledgments

We are extremely grateful to the individuals with trichohepatoen-

teric syndrome and the family members for their participation in

the study. We also want to acknowledge Sylvain Baulande and

Pascal Soularue (from Partnerchip), who kindly performed bioin-

formatics analysis after the linkage studies, Julies Salomon, who

provided DNA samples, and Laurent Villard for helpful discussion.

DNA extraction and storage were performed in the Biobank of the

Department of Genetics of La Timone Hospital. This work was

financially supported by the Assistance Publique-Hopitaux de

Marseille (AORC 2010). A.F. is supported by a scholarship from

the Fondation de l’Universite de la Mediterranee.

Received: December 10, 2011

Revised: January 23, 2012

Accepted: February 10, 2012

Published online: March 22, 2012

Web Resources

The URLs for the data presented herein are as follows:

BLAST, http://blast.ncbi.nlm.nih.gov/Blast.cgi

Ensembl, http://www.ensembl.org

Homozygositymapper, http://www.homozygositymapper.org

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

Polyphen, http://genetics.bwh.harvard.edu/pph/

UniProt, http://www.uniprot.org

Table 1. Clinical Data of Individuals Affected by SD/THESyndrome

Individualswith Mutationsin TTC37 (n ¼ 18)

Individualswith Mutationsin SKIV2L (n ¼ 6)

Premature birth (<37 weeks) 9/17 2/5

Intrauterine growth restriction 14/17 4/6

Birth weight(median and mean) in kg

1.84 (0.78–3.58);1.868

1.6 (1.01–2.00);1.47

Intractable diarrhea 18/18 6/6

Onset of diarrhea(median and mean) in weeks

3.5 (1–32); 7.75 2.5 (1–12); 3.8

Villous atrophy 16/18 3/5

Colitis 5/6 3/3

Facial dysmorphism 18/18 6/6

Hair abnormalities 18/18 6/6

Trichorrhexis nodosa 17/18 5/5

Immune deficiency 17/18 3/6

Liver disease 9/16 3/6

Siderosis 3/14 1/3

Cirrhosis 7/15 2/3

Skin abnormalities 7/16 3/4

Platelet abnormalities 7/17 0/2

Cardiac abnormalities 4/16 2/4

Outcome (deceased/alive) 4/14 2/4

Table 2. Mutations in SKIV2La, Geographical Origins, and Consanguinity in the Families of Individuals with SD/THE Syndrome

Individual Mutation 1 Mutation 2 Consanguinity Geographical Origin

1 c.1635_1636insA (p.Gly546Argfs*35) c.1635_1636insA (p.Gly546Arg*35) yes North Africa

2 c.2266C>T (p.Arg756*) c.2442G>A (p.Trp814*) no France

3 c.848G>A (p.Trp283*) c.1022T>G (p.Val341Gly) no France

4 c.2572del (p.Val858*) c.2572del (p.Val858*) yes Turkey

5 c.2662_2663del (p.Arg888Glyfs*12) c.2662_2663del (p.Arg888Glyfs*12) yes Turkey

6 c.1434del (p.Ser479Alafs*3) c.1434del (p.Ser479Alafs*3) yes Turkey

aRefSeq accession numbers NP_008860.4 and NM_006929.4.

The American Journal of Human Genetics 90, 689–692, April 6, 2012 691

References

1. Girault, D., Goulet, O., Le Deist, F., Brousse, N., Colomb, V.,

Cesarini, J.P., de Potter, S., Canioni, D., Griscelli, C., Fischer,

A., et al. (1994). Intractable infant diarrhea associated with

phenotypic abnormalities and immunodeficiency. J. Pediatr.

125, 36–42.

2. Verloes, A., Lombet, J., Lambert, Y., Hubert, A.F., Deprez, M.,

Fridman, V., Gosseye, S., Rigo, J., and Sokal, E. (1997). Tri-

cho-hepato-enteric syndrome: Further delineation of a distinct

syndrome with neonatal hemochromatosis phenotype,

intractable diarrhea, and hair anomalies. Am. J. Med. Genet.

68, 391–395.

3. Fabre, A., Andre, N., Breton, A., Broue, P., Badens, C., and

Roquelaure, B. (2007). Intractable diarrhea with ‘‘phenotypic

anomalies’’ and tricho-hepato-enteric syndrome: Two names

for the same disorder. Am. J. Med. Genet. A. 143, 584–588.

4. Hartley, J.L., Zachos, N.C., Dawood, B., Donowitz, M., For-

man, J., Pollitt, R.J., Morgan, N.V., Tee, L., Gissen, P., Kahr,

W.H., et al. (2010). Mutations in TTC37 cause trichohepatoen-

teric syndrome (phenotypic diarrhea of infancy). Gastroenter-

ology 138, 2388–2398, 2398, e1–e2.

5. Fabre, A., Martinez-Vinson, C., Roquelaure, B., Missirian, C.,

Andre, N., Breton, A., Lachaux, A., Odul, E., Colomb, V.,

Lemale, J., et al. (2011). Novel mutations in TTC37 associated

with tricho-hepato-enteric syndrome. Hum. Mutat. 32,

277–281.

6. Blatch, G.L., and Lassle, M. (1999). The tetratricopeptide

repeat: A structural motif mediating protein-protein interac-

tions. Bioessays 21, 932–939.

7. Brown, J.T., Bai, X., and Johnson, A.W. (2000). The yeast

antiviral proteins Ski2p, Ski3p, and Ski8p exist as a complex

in vivo. RNA 6, 449–457.

8. Egritas, O., Dalgic, B., and Onder, M. (2009). Tricho-hepato-

enteric syndrome presenting with mild colitis. Eur. J. Pediatr.

168, 933–935.

9. Lee, S.-G., Lee, I., Park, S.H., Kang, C., and Song, K. (1995).

Identification and characterization of a human cDNA homol-

ogous to yeast SKI2. Genomics 25, 660–666.

10. Dangel, A.W., Shen, L., Mendoza, A.R., Wu, L.-C., and Yu, C.Y.

(1995). Human helicase gene SKI2W in the HLA class III

region exhibits striking structural similarities to the yeast anti-

viral gene SKI2 and to the human gene KIAA0052: Emergence

of a new gene family. Nucleic Acids Res. 23, 2120–2126.

11. Garneau, N.L., Wilusz, J., and Wilusz, C.J. (2007). The high-

ways and byways of mRNA decay. Nat. Rev. Mol. Cell Biol.

8, 113–126.

12. Houseley, J., and Tollervey, D. (2009). The many pathways of

RNA degradation. Cell 136, 763–776.

13. Schmid,M., and Jensen, T.H. (2008). The exosome: Amultipur-

pose RNA-decay machine. Trends Biochem. Sci. 33, 501–510.

14. Parker, R., and Song,H. (2004). The enzymes and control of eu-

karyotic mRNA turnover. Nat. Struct. Mol. Biol. 11, 121–127.

15. Zhu, B., Mandal, S.S., Pham, A.D., Zheng, Y., Erdjument-Brom-

age, H., Batra, S.K., Tempst, P., and Reinberg, D. (2005). The

human PAF complex coordinates transcription with events

downstream of RNA synthesis. Genes Dev. 19, 1668–1673.

16. Toh-E, A., and Wickner, R.B. (1980). ‘‘Superkiller’’ mutations

suppress chromosomal mutations affecting double-stranded

RNA killer plasmid replication in saccharomyces cerevisiae.

Proc. Natl. Acad. Sci. USA 77, 527–530.

17. Schaeffer, D., Clark, A., Klauer, A.A., Tsanova, B., and van

Hoof, A. (2010). Functions of the cytoplasmic exosome. Adv.

Exp. Med. Biol. 702, 79–90.

18. Wang, L., Lewis, M.S., and Johnson, A.W. (2005). Domain

interactions within the Ski2/3/8 complex and between the

Ski complex and Ski7p. RNA 11, 1291–1302.

19. vanDijk, E.L., Schilders, G., and Pruijn, G.J. (2007). Human cell

growth requires a functional cytoplasmic exosome, which is

involved invariousmRNAdecaypathways.RNA13, 1027–1035.

20. Staals, R.H., and Pruijn, G.J. (2010). The human exosome and

disease. Adv. Exp. Med. Biol. 702, 132–142.

692 The American Journal of Human Genetics 90, 689–692, April 6, 2012

REPORT

Mutations in C5ORF42 Cause Joubert Syndromein the French Canadian Population

Myriam Srour,1,11 Jeremy Schwartzentruber,2,11 Fadi F. Hamdan,1 Luis H. Ospina,3 Lysanne Patry,1

Damian Labuda,4 Christine Massicotte,4 Sylvia Dobrzeniecka,1 Jose-Mario Capo-Chichi,1

Simon Papillon-Cavanagh,4 Mark E. Samuels,4 Kym M. Boycott,5 Michael I. Shevell,6

Rachel Laframboise,7 Valerie Desilets,4 FORGE Canada Consortium,12 Bruno Maranda,8

Guy A. Rouleau,9 Jacek Majewski,10 and Jacques L. Michaud1,*

Joubert syndrome ( JBTS) is an autosomal-recessive disorder characterized by a distinctive mid-hindbrain malformation, developmental

delay with hypotonia, ocular-motor apraxia, and breathing abnormalities. Although JBTS was first described more than 40 years ago in

French Canadian siblings, the causal mutations have not yet been identified in this family nor in most French Canadian individuals

subsequently described.We ascertained a cluster of 16 JBTS-affected individuals from 11 families living in the Lower St. Lawrence region.

SNP genotyping excluded the presence of a common homozygous mutation that would explain the clustering of these individuals.

Exome sequencing performed on 15 subjects showed that nine affected individuals from seven families (including the original JBTS

family) carried rare compound-heterozygous mutations in C5ORF42. Two missense variants (c.4006C>T [p.Arg1336Trp] and

c.4690G>A [p.Ala1564Thr]) and a splicing mutation (c.7400þ1G>A), which causes exon skipping, were found in multiple subjects

that were not known to be related, whereas three other truncating mutations (c.6407del [p.Pro2136Hisfs*31], c.4804C>T

[p.Arg1602*], and c.7477C>T [p.Arg2493*]) were identified in single individuals. None of the unaffected first-degree relatives were

compound heterozygous for these mutations. Moreover, none of the six putative mutations were detected among 477 French Canadian

controls. Our data suggest that mutations in C5ORF42 explain a large portion of French Canadian individuals with JBTS.

Joubert syndrome (JBTS [MIM 213300]) is an autosomal-

recessive disorder characterized by the presence of hypo-

tonia, apnea or hyperpnea in infancy, oculomotor apraxia,

and variable developmental delay or intellectual impair-

ment (reviewed in Sattar et al.1). The diagnostic hallmark

of JBTS is the presence of a complex malformation of the

midbrain-hindbrain junction that comprises cerebellar

vermis hypoplasia or aplasia, deepened interpeduncular

fossa, and elongated superior cerebellar peduncles. This

malformation appears like a molar tooth on an axial brain

MRI (magnetic resonance imaging). In a subset of individ-

uals, JBTS also involves other organs and results in cystic

kidneys, retinopathy, or polydactyly. JBTS is a genetically

heterogeneous condition for which 15 genes have been

described to date.2–19 All of these genes appear to play a

role in the development and/or function of nonmotile

cilia. Although JBTS was first described in French Canadian

siblings more than 40 years ago by Marie Joubert and

colleagues, until now, the causal mutations have not yet

been identified in the original family nor in most French

Canadians subjects.20,21

There is a highprevalence of JBTS in the FrenchCanadian

population living in the Lower St. Lawrence (‘‘Bas-du-

Fleuve’’ in French) region of the province of Quebec

(Figure 1). In total, we identified 16 living affected individ-

uals (from 11 unrelated families) who have at least one

grandparent originating from that region. Informed con-

sent was obtained from all individuals or their legal guard-

ians. This project was approved by our institutional ethics

committee. We were initially able to collect blood-derived

DNA from 15 of these individuals, including an affected

individual (II-1 in family 394; individual BD in Joubert

et al.20) from the original JBTS family described by Marie

Joubert and colleagues in 1969. There was a striking cluster

of seven families from the east end of the region (Matapedia

region); one family is fromMont-Joli (population of 6,568),

three families are from Amqui (population of 6,261), and

three other families are from Sayabec (population of

1,877). Individual II-1 from family 394 did not undergo

brain-imaging studies, but an MRI scan performed on her

brother (II-2) showed the molar-tooth sign (MTS) (Fig-

ure 2B).21 All the other affected individuals showed the

MTS and variable expression of the classical JBTS features.

The cohort included three families with two affected

siblings, and the parents were not affected in any family

(consistent with a recessive mode of transmission).

1Centre of Excellence in Neurosciences, Universite de Montreal and Sainte-Justine Hospital Research Center, Montreal H3T 1C5, Canada; 2McGill Univer-

sity and Genome Quebec Innovation Centre, Montreal H3A 1A4, Canada; 3Department of Ophthalmology, Sainte-Justine Hospital Research Center,

Montreal H3T 1C5, Canada; 4Sainte-Justine Hospital Research Center, Montreal H3T 1C5, Canada; 5Children’s Hospital of Eastern Ontario Research

Institute, Ottawa K1H 8L1, Canada; 6Division of Pediatric Neurology, Montreal Children’s Hospital-McGill University Health Center, Montreal H3H

1P3, Canada; 7Department of Medical Genetics, Centre Hospitalier Universitaire Laval, Quebec G1V 4G2, Canada; 8Division of Genetics, Centre Hospitalier

Universitaire de Sherbrooke, Sherbrooke J1H 5N4, Canada; 9Centre of Excellence in Neurosciences of Universite de Montreal, Centre Hospitalier de

l’Universite de Montreal Research Center and Department of Medicine, Montreal H2L 2W5, Canada; 10Department of Human Genetics, McGill University,

Montreal H3A 1A4, Canada11These authors contributed equally to this work12FORGE Steering Committee is listed in Acknowledgements

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.011. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 693–700, April 6, 2012 693

It was initially established that the population of the

Lower St. Lawrence region was a result of both the immi-

gration of a limited number of settlers (6,000 individuals)

from Quebec City and its surrounding areas in the late

17th century and beginning of the 18th century and a rapid

increase in settlers resulting from a high fertility rate.22

The establishment of settlers in the region followed a

west-to-east pattern, and settlers later migrated to regions

farther east. A small number of Acadians also contributed

to the early population of the Matapedia region.23 The

demographic growth of this population thus appears to

be characterized by a series of bottlenecks that might

have resulted in regional founder effects. We hypothesized

that a founder effect could underlie the clustering of

individuals with JBTS in the Lower St. Lawrence region,

raising the possibility that a common homozygous muta-

tion explains a large portion of them. We performed

whole-genome SNP genotyping in all 15 individuals with

JBTS by using the Illumina Human 610 Genotyping

BeadChip panel, which interrogates 620,901 SNPs, and

we used PLINK24 to search for homozygosity regions con-

taining >30 consecutive SNPs and extending over >1Mb.

We identified several overlapping regions of shared

homozygosity, but these regions were not found in more

than five families, were small (1 megabase or less), and

contained genes that are unlikely to play a role in cilia

development and/or function (Table S1, available online).

Altogether, the genotyping data suggest the presence of

allelic and/or genetic heterogeneity within our cohort.

Given the lack of hints from genotype-based mapping,

we decided to sequence the protein-coding exomes of all

our JBTS-affected subjects in the hopes of identifying a

unique candidate gene harboring private pathogenic vari-

ants in a large fraction of the samples. Genomic DNA from

each sample was captured with the Agilent SureSelect

Figure 1. Distribution of Individuals with JBTSin the Lower St. Lawrence RegionNumbers refer to families (pedigrees in Figure 2).Note the cluster of families along Route 132,which follows the Matapedia River.

50 Mb oligonucleotide library, and the

captured DNA was sequenced with paired-

end 100 bp reads on Illumina HiSeq2000.

The result was an average of 14.7 Gb of

raw sequence for each sample. Data were

analyzed as previously described.25 After

we used Picard (v. 1.48) to remove putative

PCR-generated duplicate reads, we aligned

the reads to human genome assembly

hg19 by using a Burroughs-Wheeler algo-

rithm (BWA v. 0.5.9). The median read

depth of the bases in CCDS (consensus

coding sequence) exons was 115 (deter-

mined with Broad Institute Genome

Analysis Toolkit v. 1.0.4418).26 On average,

88% (52.9%) of the bases in CCDS exons were covered

by at least 20 reads. We called sequence variants by

using custom scripts for Samtools (v. 0.1.17), Pileup, and

varFilter, and we required at least three variant reads as

well as >20% variant reads for each called position.

Single-nucleotide variants (SNVs) had Phred-like quality

scores of at least 20, and small insertions or deletions

(indels) had scores of at least 50. We used Annovar to

annotate variants according to the type of mutation,

occurrence in dbSNP, SIFT score, and 1,000 Genomes allele

frequency.27 To identify potentially pathogenic variants,

we filtered out (1) synonymous variants or intronic vari-

ants other than those affecting the consensus splice sites,

(2) variants seen in more than one of 261 exomes from

individuals with rare, monogenic diseases unrelated to

JBTS (these individuals were sequenced at the McGill

University and Genome Quebec Innovation Centre), and

(3) variants with a frequency greater than 0.5% in the

1,000 Genomes Browser (Tables 1 and 2).

We first examined the exome datasets to look for rare

variants in the 15 genes already associated with JBTS (these

genes are INPP5E [MIM 613037], TMEM216 [MIM 613277],

AHI1 [MIM 608894], NPHP1 [MIM 607100], CEP290

[MIM 610142], TMEM67 [MIM 609884], RPGRIP1L

[MIM 610937], ARL13B [MIM 608922], CC2D2A [MIM

612013], CXORF5 [MIM 300170], KIF7 [MIM 611254],

TCTN1 [MIM 609863], TCTN2 [MIM 613885], TMEM237

[MIM 614424], and CEP41)2–19 as well as in the JBTS

candidate gene, TTC21B (MIM 612014).28 Two individuals

(II-1 from family 484 and II-2 from family 473, Figure S2)

that are not known to be related were each found to be

carrying two heterozygous missense variants (c.4667A>T

[p.Asp1556Val] and c.3376G>A [p.Glu1126Lys]) in

CC2D2A (RefSeq accession number NM_001080522.2).13,14

These amino acids are highly conserved, and both

694 The American Journal of Human Genetics 90, 693–700, April 6, 2012

mutations are predicted to be deleterious according to

SIFT (scores < 0.05)29 and Polyphen-2 (scores > 0.90)30

(Figure S1). The c.4667A>T (p.Asp1556Val) mutation has

already been reported in individuals with JBTS.31 Segrega-

tion studies have indicated that the affected individuals

but none of their unaffected first-degree relatives were

compound heterozygous for these mutations (Figure S2).

We conclude that these mutations are probably patho-

genic. Both individuals have a mild phenotype. They

have oculomotor apraxia and only mild motor delay

(they walked at 18 [II-1 from family 484] and 19 [II-2 from

family 473] months of age and do not have gait ataxia).

The individual who is of school age performs well in

a regular classroom. Four additional individuals were sin-

gly heterozygous for rare variants in the other known

B Family 394

WTc.4006C>T

C Family 474

WT6407del

WTc.4006C>T

c.6407delc.4006C>T

WTWT

D Family 480 E Family 479

A Family 406/301

c.7400+1G>Ac.4006C>T

WTc.7400+1G>A

WTc.4006C>T

WTc.4690G>A

WTc.7400+1G>A

c.7400+1G>Ac.4690G>A

c.7400+1G>Ac.4690G>A

I

II

III

IV

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3

G Family 468

WTWT

c.4006C>Tc.4690G>A

I

II

WTc.4006C>T

WTc.4690G>A

1 2

1 2

F Family 489

I

II

I

II

WTc.4006C>T

c.4006C>Tc.7400+1G>A

WTc.7400+1G>A

WTc.7400+1G>A

1 2

1 2

WTc.4006C>T

WTc.4804C>T

c.4006C>Tc.4804C>T

I

II

1 2

1

1 2

1 2

c.7400+1G>Ac.4006C>T

WTc.4006C>T

1 2

1 2 3 4

c.7400+1G>Ac.4006C>T

I

II

I

II

WTc.7477C>T

WTc.4690G>A

c.4690G>Ac.7477C>T

WTc.7477C>T

WTWT

1 2

1 2 3

Figure 2. Segregation of C5ORF42Muta-tions in Families Affected by JBTS

JBTS-associated genes; such variants

are c.265C>T (p.Leu89Phe) in

TMEM216 (M_001173991.2),

c.3257A>G (p.Glu1086Gly) in

AHI1 (NM_001134831), c.1600G>A

(p.Glu534Lys) in CEP290 (NM_

025114.3), and c.3032T>C

(p.Met1011Thr) in TTC21B (NM_

024753.4). Because each of these

genes has previously been associated

with recessive JBTS, these heterozy-

gous variants are unlikely to fully

explain the disorder that these indi-

viduals have.

We next looked at the whole-exome

data for the other protein-coding

genes containing homozygous or

multiple heterozygous variants in the

13 affected individuals who did not

have mutations in CC2D2A (Tables 1

and 2). Strikingly, five subjects,

including a member of the initial

JBTS family, carried two different

heterozygous variants in an un-

studied anonymous gene, C5ORF42

(NM_023073.3). Mutations in six

other genes were found in affected

individuals among sets of three

families (Tables 1 and2). Because these

latter genes (MUC5B, PLEC, FAT3, FLG,

TTN, and LAMA5) are known to accu-

mulate mutations at a high rate, they

are unlikely to be linked to the disease

(Table S2). All five affected individuals

with changes in C5ORF42 carried the

same missense mutation, c.4006C>T

(p.Arg1336Trp) (NM_023073.3), as

well as one of three different mutations: one mutation

that affects a consensus donor splice site, c.7400þ1G>A

(NM_023073.3), and two truncating mutations, c.6407del

(p.Pro2136Hisfs*31) and c.4804C>T (p.Arg1602*) (NM_

023073.3) (Figures 2 and 3 and Table 3). Sanger sequencing

in the five affected individuals confirmed the presence of

these variants. Segregation studies indicated that the

affected individuals, but not their unaffected first-degree

relatives, were compound heterozygotes for these variants

(Figure 2 and Table 3). Subsequently, wewere able to collect

DNA from individual II-2 (individual M.D.19-20), the

affected brother of II-1 in the initial JBTS family (family

394), and we found that he was compound heterozygous

for the same C5ORF42 mutations identified in his affected

sister (Figure 2B and Table 3). None of these four variants

The American Journal of Human Genetics 90, 693–700, April 6, 2012 695

was detected in 261 in-house control exomes, which were

derived from other projects including some French Cana-

dian subjects, and in the 1,000 Genomes Browser. RT-PCR

performed on RNA extracted from the blood of individuals

II-2 (from family 394) and III-4 (from family 406/301), who

both carry the c.7400þ1G>A splicing mutation, showed

that this mutation causes skipping of exon 35 in C5ORF42

(NM_023073.3) and results in the creation of a premature

stop codon (Figure S3). The p.Arg1336Trp amino acid

substitution is predicted to be damaging (SIFT¼ 0.00; Poly-

phen-2 ¼ 0.99) and to affect a residue that is conserved

across vertebrate species (Figure 3B).

On the basis of the exome-sequencing data, four

additional JBTS-affected individuals from three families

(301, 468, and 489) were each carrying a single heterozy-

gous C5ORF42 mutation, including the already described

c.4006C>T (p.Arg1336Trp) and c.7400þ1G>A mutations

and the truncating mutation c.7477C>T (p.Arg2493*)

(NM_023073.3) (Figure 2 and Table 3). The c.7477C>T

(p.Arg2493*) mutation was absent from our 261 control

exomes and the 1,000 Genomes Browser. Our SNP geno-

typing data suggest that these four individuals—but not

the other individuals with JBTS in our cohort—are hetero-

zygous for a unique 5 Mb haplotype that encompasses

C5ORF42 (Figure S4). It seemed unlikely that this haplo-

type would be carrying three different rare mutations;

therefore, this observation suggests that the four individ-

uals might carry a second mutation linked to this haplo-

type. Upon further inspection of the exome data, we

discovered that all four individuals are also heterozygous

for another missense variant, c.4690G>A (p.Ala1564Thr)

(based on the ENST00000388739 transcript annotated by

the Ensemble Genome Browser). This allele was not

included in our original filtered dataset because it is located

in an internal coding exon (chr5: 37,157,522–37,157,415)

not annotated by RefSeq for the longest isoform of the

gene (NM_023073.3). Sanger sequencing confirmed the

presence of the various mutations in the four affected

individuals. Segregation studies showed that the four

affected individuals but none of their unaffected first-

degree relatives were compound heterozygous for

c.4690G>A (p.Ala1564Thr) and for one of the three other

mutations (c.4006C>T [p.Arg1336Trp], c.7400þ1G>A,

and c.7477C>T [p.Arg2493*]) (Figure 2). The additional,

alternative exon (which we designate exon 40a) with the

c.4690G>A (p.Ala1564Thr) mutation occurs between

RefSeq annotated exons 40 and 41 (NM_023073.3), is

present in brain expressed sequence tag (EST) clones with

GenBank accession numbers AK096581 and BC144070,

and retains the large open reading frame of the gene. Using

RNA-sequencing data made publicly available by Illumi-

na’s Body Map 2.0 (see Web Resources), we were able to

confirm the expression of the exon. The assembly of raw

data from 16 different tissues identified a large number of

reads that mapped to that exon in both brain and testes

samples; significantly fewer reads mapped to other tissues

(Figure S5). Reads that covered both ends of the exon and

spliced correctly to neighboring exons were found in either

brain or testes samples. The c.4690G>A (p.Ala1564Thr)

mutation was also absent from our 261 control exomes

and from the 1,000 Genomes Browser. It was not possible

to get accurate SIFT or Polyphen-2 predictions for this

mutation because the corresponding exon was not anno-

tated across species.

We further addressed the frequency of the six putative

C5ORF42 mutations identified in our JBTS individuals in

the French Canadian population. Genotyping 477 French

Canadian controls, including 96 Acadians subjects and 96

subjects from the Gaspesie region located immediately east

of Matapedia, did not identify a carrier of any of the six

C5ORF42 mutations. However, some of these mutations

are reported in the heterozygous state at very low frequen-

cies in the National Heart, Lung, and Blood Institute

(NHLBI) Go Exome Sequencing Project (ESP) dataset; these

mutations are c.4006C>T (p.Arg1336Trp) (2/10,754;

minor allele frequency [MAF] ¼ 0.0186%), c.7477C>T

(p.Arg2493*) (1/10,755; MAF ¼ 0.009%; rs139675596),

and c.4690G>A (p.Ala1564Thr) (12/4,574; MAF ¼0.262%; rs111294855). It should be noted that

c.4006C>T and c.7477C>T correspond to CpG sites,

Table 2. Genes with Rare Homozygous or Multiple HeterozygousVariants from the Combined Exome Sequences from 13 Individualswith JBTS

Number of Familieswith Mutationsin the Same Gene

Numberof Genes Gene Identity

1 family 528 C5ORF42, .

2 families 16 C5ORF42, ACAN, ADAMTS18,C10orf68, FSIP2, LRP1B,MUC12, MUC16, MUC4,MYO16, PKD1L2, PKHD1L1,RGPD4, SHROOM4,TMEM231, ZNF717

3 families 7 C5ORF42, MUC5B, PLEC,FAT3, FLG, TTN, LAMA5

4 families 1 C5ORF42

5 families 1 C5ORF42

>5 families 0 -

Table 1. Variant Prioritization Steps in the Analysis of CombinedExome Sequences from 13 Individuals with JBTS

Filters Applied (Sequentially)Number ofVariants Retained

Nonsynonymous, splicing, and coding indelvariants

34,157a

After excluding variants present in >1in-house exome

7,075

After excluding variants reported in 1,000Genomes Browser (frequency > 0.5%)

6,911

aTotal number of variants identified in the combined 13 exomes; redundantvariants were counted only once.

696 The American Journal of Human Genetics 90, 693–700, April 6, 2012

which are associated with a higher mutation rate, possibly

explaining the recurrence of these nonetheless rare muta-

tions in different populations.

The presence of five potentially deleterious C5ORF42

mutations that segregate with the disease in seven presum-

ably unrelated (though all French Canadian) families

strongly suggests that disruption of this gene causes JBTS

in our subjects. It remains uncertain whether c.4690G>A

(p.Ala1564Thr) is pathogenic, considering that it is not

clearly deleterious and that it is found at a higher

frequency (0.26%) in the ESP dataset than are the other

mutations. It is possible that this variant is linked to

another mutation—not identified by our exome-

sequencing approach—on the same haplotype.

Very little is known about C5ORF42 function. The Ref-

Seq version of the full-length transcript (NM_023073.3;

Ensemble accession number ENST00000425232) appar-

ently derives from virtual assembly of overlapping mRNA

and EST clones. The predicted major mRNA isoform

comprises 11,199 bp and contains 52 exons; the putative

encoded protein is similarly large and comprises 3,198

amino acids. With the exception of c.4690G>A

(p.Ala1564Thr), all mutations reported herein are common

to all annotated protein-coding transcripts (Figure 3A). The

predicted protein sequence is well conserved across much

of the gene length in other vertebrates. It does not appear

to contain any specific known functional domains,

although the Gene Ontology project suggests that it might

be a transmembrane protein and ProtoNet predicts a

coiled-coil structure within the protein. Proteomic studies

have reported interactions among C5ORF42, the p21-

activating kinase 1 (PAK1), and the small ubiquitin-like

modifier 1 (SUMO1).32,33 Although the significance of

these interactions remains to be validated and further

investigated, it is noteworthy that these latter genes play

a role in neural development.34,35 EST-expression (Unig-

ene data), microarray profiling (Allen Brain Atlas), and

BioGPS indicate that C5ORF42 is widely expressed in

a variety of tissues, including the brain.

In terms of genotype-phenotype correlation, all JBTS

individuals with mutations in C5ORF42 showed global

developmental delay, and the onset of independent

Figure 3. C5ORF42 Mutations Identified in Individuals with JBTS(A) Scheme showing the positions of the mutations with respect to the different C5ORF42 Ensembl-annotated transcripts that are pre-dicted to produce proteins. The numbering on top is based on the cDNA positions of ENST00000425232 (identical to RefSeq accessionnumber NM_023073.3). Mutation c.7957þ288G>A is annotated as part of a coding exon in ENST00000388739 and causes a missensechange (p.Ala1564Thr).(B) NCBI HomoloGene-generated amino acid alignment of C5ORF42. Its predicted orthologs show the conservation of the Arg1336residue.

The American Journal of Human Genetics 90, 693–700, April 6, 2012 697

walking ranged between 30 months and 8 years of age

(Table 3). Cognitive impairment was present in all individ-

uals but was variable, ranging from borderline intelligence

to mild intellectual disability. The majority of individuals

also showed oculomotor apraxia and breathing abnormal-

ities mainly characterized by episodes of hyperventilation.

Two individuals showed limb abnormalities; one had

preaxial and postaxial polydactyly, and another had

syndactyly of the third and fourth finger on one hand.

There was no evidence of retinal or kidney involvement.

There was no clear correlation between the type of

C5ORF42 mutation and the associated phenotype.

Surprisingly, we found that three mutations (c.4006C>T

[p.Arg1336Trp], c.7400þ1G>A, and c.4690G>A

[p.Ala1564Thr]) in C5ORF42 were present in multiple

individuals in our cohort. Haplotype studies indicate that

each of these mutations is linked to a distinct haplotype

in these families despite the lack of documented genealog-

ical relationships among them (Figure S4). The higher

frequency of these mutations in the population of the

Lower St. Lawrence region could be explained by a founder

effect with the coincidental occurrence of the three muta-

tions in the same group of settlers or by multiple regional

founder effects corresponding to sequential pioneer fronts.

Although founder effects are typically associated with an

increase in the frequency of a specific allele,33 which is

often accompanied by other alleles that remain at their

usual background frequency, they can also involve

multiple common mutations.36,37

In summary, after the initial description of JBTS in a

French Canadian family 40 years ago, we have shown

that mutations in C5ORF42 explain this neurodevelop-

mental disorder in many affected individuals from the

French Canadian population. We have also found that

C5ORF42 is associated with a complex founder effect in

this population. Although the function of C5ORF42

remains unknown, future studies will likely elucidate its

role in cilia development and/or function.

Supplemental Data

Supplemental Data include five figures and two tables and can be

found with this article online at http://www.cell.com/AJHG.

Acknowledgments

Foremost, we thank the families who generously contributed their

time and materials to this research study. This work was selected

for study by the FORGE Canada Steering Committee, consisting

of K. Boycott (University of Ottawa), J. Friedman (University of

British Columbia), J. Michaud (Universite de Montreal), F. Bernier

(University of Calgary), M. Brudno (University Toronto), B. Fer-

nandez (Memorial University), B. Knoppers (McGill University),

Table 3. Clinical Description of JBTS Individuals with C5ORF42 Mutations

Genotype

Family 406/301 Family 394 Family 474 Family 480 Family 489 Family 479 Family 468

IV-1 IV-2 IV-3 II-1 II-2 II-1 II-1 II-1 II-1 II-1

c.4006C>T (p.Arg1336Trp) þ � � þ þ þ þ � þ þ

c.7400þ1G>A þ þ þ þ þ � þ � � �

c.6407del (p.Pro2136Hisfs*31) � � � � � þ � � � �

c.7477C>T (p.Arg2493*) � � � � � � � þ � �

c.4804C>T (p.Arg1602*) � � � � � � � � þ �

c.7957þ288G>A(c.4690G>A [p.Ala1564Thr])

� þ þ � � � � þ � þ

Age (years) 8 1.5 3 52 45 4 10 7 13 31

Sex F M F F M F M M F F

Developmental delay þ þ þ þ þ þ þ þ þ þ

Oculomotor apraxia � þ þ þ þ þ þ þ þ þ

Breathing abnormality þ þ þ þ þ þ þ þ � �

Limb abnormalitya � þ � � � þ � � � �

Brain MRI MTS MTS MTS ND MTS MTS MTS MTS MTS MTS

Retinal involvementb � (f) � (e) � (e) � (h) � (h) � (f) � (e) � (e) � (f) � (h)

Renal involvementc � (us) � (us) � (us) � (h) � (h) � (us) � (us) � (us) � (us) � (h)

The nucleotide and amino acid positions are based on reference sequence NM_023073.3 except for c.4690G>A (p.Ala1564Thr), which is based on Ensembl tran-script ENST00000509849. The following abbreviations are used: F, female; M, male; MRI, magnetic resonance imaging; MTS, molar tooth sign; ND, not done; f,fundoscopy; e, electroretinogram; h, history; and us, ultrasound.aIndividual IV-2 from family 406/301 has a 3/4 syndactyly in the left hand and individual II.1 from family 474 has preaxial and postaxial polydactyly of the fourlimbs. Individual II-1 from family 394 did not undergo an MRI, but the MRI of her brother (individual II-2 from family 394) documented a MTS.bLack of retinal involvement was determined by electroretinogram, fundoscopy, or history.cLack of renal involvement was determined by renal ultrasound or history.

698 The American Journal of Human Genetics 90, 693–700, April 6, 2012

M. Samuels (Universite de Montreal), and S. Scherer (University of

Toronto). We would like to thank Janet Marcadier (clinical

coordinator) and Chandree Beaulieu (project manager) for their

contribution to the infrastructure of the FORGE Canada Consor-

tium. The authors also wish to acknowledge the contribution of

the high-throughput sequencing platform of the McGill

University and Genome Quebec Innovation Centre (Montreal,

Canada). This work was funded by the Government of Canada

through Genome Canada, the Canadian Institutes of Health

Research (CIHR), and the Ontario Genomics Institute (OGI-049).

Additional funding was provided by Genome Quebec and

Genome British Columbia. K. Boycott is supported by a Clinical

Investigatorship Award from the CIHR Institute of Genetics.

J.L. Michaud is a National Scholar from the Fonds de la Recherche

en Sante du Quebec (FRSQ). M. Srour holds a training award from

the FRSQ.

Received: December 15, 2011

Revised: January 23, 2012

Accepted: February 13, 2012

Published online: March 15, 2012

Web Resources

The URLs for data presented herein are as follows:

1,000 Genomes Browser, http://browser.1000genomes.org/index.

html

Allen Brain Atlas, http://www.brain-map.org/

BioGPS, http://biogps.org

dbSNP, http://www.ncbi.nlm.nih.gov/projects/SNP/

Ensemble Genome Browser, http://www.ensembl.org

ESP Exome Variant Server, http://evs.gs.washington.edu/EVS/

Gene Ontology, http://www.geneontology.org/

Illumina’s Body Map 2.0 transcriptome, http://www.ebi.ac.uk/

arrayexpress/browse.html?keywords ¼ E-MTAB-513

NCBI HomoloGene, http://www.ncbi.nlm.nih.gov/homologene

NCBI Nucleotide Database, http://www.ncbi.nlm.nih.gov/nuccore

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

Polyphen-2, http://genetics.bwh.harvard.edu/pph2/

SIFT, http://sift.jcvi.org/

Unigene, http://www.ncbi.nlm.nih.gov/unigene

References

1. Sattar, S., and Gleeson, J.G. (2011). The ciliopathies in

neuronal development: a clinical approach to investigation

of Joubert syndrome and Joubert syndrome-related disorders.

Dev. Med. Child Neurol. 53, 793–798.

2. Bielas, S.L., Silhavy, J.L., Brancati, F., Kisseleva, M.V., Al-Gazali,

L., Sztriha, L., Bayoumi, R.A., Zaki, M.S., Abdel-Aleem, A.,

Rosti, R.O., et al. (2009). Mutations in INPP5E, encoding

inositol polyphosphate-5-phosphatase E, link phosphatidyl

inositol signaling to the ciliopathies. Nat. Genet. 41, 1032–

1036.

3. Edvardson, S., Shaag, A., Zenvirt, S., Erlich, Y., Hannon, G.J.,

Shanske, A.L., Gomori, J.M., Ekstein, J., and Elpeleg, O.

(2010). Joubert syndrome 2 (JBTS2) in Ashkenazi Jews is

associated with a TMEM216 mutation. Am. J. Hum. Genet.

86, 93–97.

4. Valente, E.M., Logan, C.V., Mougou-Zerelli, S., Lee, J.H.,

Silhavy, J.L., Brancati, F., Iannicelli, M., Travaglini, L., Romani,

S., Illi, B., et al. (2010). Mutations in TMEM216 perturb cilio-

genesis and cause Joubert, Meckel and related syndromes.

Nat. Genet. 42, 619–625.

5. Dixon-Salazar, T., Silhavy, J.L., Marsh, S.E., Louie, C.M., Scott,

L.C., Gururaj, A., Al-Gazali, L., Al-Tawari, A.A., Kayserili, H.,

Sztriha, L., and Gleeson, J.G. (2004). Mutations in the AHI1

gene, encoding jouberin, cause Joubert syndrome with

cortical polymicrogyria. Am. J. Hum. Genet. 75, 979–987.

6. Parisi, M.A., Bennett, C.L., Eckert, M.L., Dobyns, W.B., Glee-

son, J.G., Shaw, D.W., McDonald, R., Eddy, A., Chance, P.F.,

and Glass, I.A. (2004). The NPHP1 gene deletion associated

with juvenile nephronophthisis is present in a subset of indi-

viduals with Joubert syndrome. Am. J. Hum. Genet. 75, 82–91.

7. Valente, E.M., Silhavy, J.L., Brancati, F., Barrano, G., Krishnas-

wami, S.R., Castori, M., Lancaster, M.A., Boltshauser, E., Boc-

cone, L., Al-Gazali, L., et al; International Joubert Syndrome

Related Disorders Study Group. (2006). Mutations in

CEP290, which encodes a centrosomal protein, cause pleio-

tropic forms of Joubert syndrome. Nat. Genet. 38, 623–625.

8. Sayer, J.A., Otto, E.A., O’Toole, J.F., Nurnberg, G., Kennedy,

M.A., Becker, C., Hennies, H.C., Helou, J., Attanasio, M.,

Fausett, B.V., et al. (2006). The centrosomal protein nephro-

cystin-6 is mutated in Joubert syndrome and activates

transcription factor ATF4. Nat. Genet. 38, 674–681.

9. Baala, L., Romano, S., Khaddour, R., Saunier, S., Smith,U.M., Au-

dollent, S., Ozilou, C., Faivre, L., Laurent, N., Foliguet, B., et al.

(2007). The Meckel-Gruber syndrome gene, MKS3, is mutated

in Joubert syndrome. Am. J. Hum. Genet. 80, 186–194.

10. Arts, H.H., Doherty, D., van Beersum, S.E., Parisi, M.A.,

Letteboer, S.J., Gorden, N.T., Peters, T.A., Marker, T., Voesenek,

K., Kartono, A., et al. (2007). Mutations in the gene encoding

the basal body protein RPGRIP1L, a nephrocystin-4 interactor,

cause Joubert syndrome. Nat. Genet. 39, 882–888.

11. Delous, M., Baala, L., Salomon, R., Laclef, C., Vierkotten, J.,

Tory, K., Golzio, C., Lacoste, T., Besse, L., Ozilou, C., et al.

(2007). The ciliary gene RPGRIP1L is mutated in cerebello-

oculo-renal syndrome (Joubert syndrome type B) and Meckel

syndrome. Nat. Genet. 39, 875–881.

12. Cantagrel, V., Silhavy, J.L., Bielas, S.L., Swistun, D., Marsh,

S.E., Bertrand, J.Y., Audollent, S., Attie-Bitach, T., Holden,

K.R., Dobyns, W.B., et al; International Joubert Syndrome

Related Disorders Study Group. (2008). Mutations in the cilia

gene ARL13B lead to the classical form of Joubert syndrome.

Am. J. Hum. Genet. 83, 170–179.

13. Noor, A.,Windpassinger, C., Patel, M., Stachowiak, B., Mikhai-

lov, A., Azam, M., Irfan, M., Siddiqui, Z.K., Naeem, F., Pater-

son, A.D., et al. (2008). CC2D2A, encoding a coiled-coil and

C2 domain protein, causes autosomal-recessive mental retar-

dation with retinitis pigmentosa. Am. J. Hum. Genet. 82,

1011–1018.

14. Gorden, N.T., Arts, H.H., Parisi, M.A., Coene, K.L., Letteboer,

S.J., van Beersum, S.E., Mans, D.A., Hikida, A., Eckert, M.,

Knutzen, D., et al. (2008). CC2D2A is mutated in Joubert

syndrome and interacts with the ciliopathy-associated basal

body protein CEP290. Am. J. Hum. Genet. 83, 559–571.

15. Dafinger, C., Liebau, M.C., Elsayed, S.M., Hellenbroich, Y.,

Boltshauser, E., Korenke, G.C., Fabretti, F., Janecke, A.R., Eber-

mann, I., Nurnberg, G., et al. (2011). Mutations in KIF7 link

Joubert syndrome with Sonic Hedgehog signaling and micro-

tubule dynamics. J. Clin. Invest. 121, 2662–2667.

The American Journal of Human Genetics 90, 693–700, April 6, 2012 699

16. Garcia-Gonzalo, F.R., Corbit, K.C., Sirerol-Piquer, M.S., Ramas-

wami, G., Otto, E.A., Noriega, T.R., Seol, A.D., Robinson, J.F.,

Bennett, C.L., Josifova, D.J., et al. (2011). A transition zone

complex regulates mammalian ciliogenesis and ciliary

membrane composition. Nat. Genet. 43, 776–784.

17. Sang, L., Miller, J.J., Corbit, K.C., Giles, R.H., Brauer, M.J.,

Otto, E.A., Baye, L.M., Wen, X., Scales, S.J., Kwong, M., et al.

(2011). Mapping the NPHP-JBTS-MKS protein network

reveals ciliopathy disease genes and pathways. Cell 145,

513–528.

18. Huang, L.J., Szymanska, K., Jensen, V.L., Janecke, A.R., Innes,

A.M., Davis, E.E., Frosk, P., Li, C.M., Willer, J.R., Chodirker,

B.N., et al. (2011). TMEM237 is mutated in individuals with

a Joubert syndrome related disorder and expands the role of

the TMEM family at the ciliary transition zone. Am. J. Hum.

Genet. 89, 713–730.

19. Lee, J.E., Silhavy, J.L., Zaki, M.S., Schroth, J., Bielas, S.L.,

Marsh, S.E., Olvera, J., Brancati, F., Iannicelli, M., Ikegami,

K., et al. (2012). CEP41 is mutated in Joubert syndrome and

is required for tubulin glutamylation at the cilium. Nat.

Genet. 44, 193–199.

20. Joubert, M., Eisenring, J.J., Robb, J.P., and Andermann, F.

(1969). Familial agenesis of the cerebellar vermis. A syndrome

of episodic hyperpnea, abnormal eye movements, ataxia, and

retardation. Neurology 19, 813–825.

21. Andermann, F., Andermann, E., Ptito, A., Fontaine, S., and

Joubert, M. (1999). History of Joubert syndrome and a 30-

year follow-up of the original proband. J. Child Neurol. 14,

565–569.

22. Fortin, J.C., and Lechasseur, A. (1999). Le Bas-Saint-Laurent

(Quebec: Presses de l’Universite Laval).

23. Hebert, P.M. (1994). Les Acadiens du Quebec (Montreal:

Editions de l0echo).24. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira,

M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly,

M.J., and Sham, P.C. (2007). PLINK: A tool set for whole-

genome association and population-based linkage analyses.

Am. J. Hum. Genet. 81, 559–575.

25. Majewski, J., Schwartzentruber, J.A., Caqueret, A., Patry, L.,

Marcadier, J., Fryns, J.P., Boycott, K.M., Ste-Marie, L.G.,

McKiernan, F.E., Marik, I., et al; FORGE Canada Consortium.

(2011). Mutations in NOTCH2 in families with Hajdu-Cheney

syndrome. Hum. Mutat. 32, 1114–1117.

26. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis,

K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly,

M., and DePristo, M.A. (2010). The Genome Analysis Toolkit:

A MapReduce framework for analyzing next-generation DNA

sequencing data. Genome Res. 20, 1297–1303.

27. Wang, K., Li, M., and Hakonarson, H. (2010). ANNOVAR:

Functional annotation of genetic variants from high-

throughput sequencing data. Nucleic Acids Res. 38, e164.

28. Davis, E.E., Zhang, Q., Liu, Q., Diplas, B.H., Davey, L.M., Hart-

ley, J., Stoetzel, C., Szymanska, K., Ramaswami, G., Logan,

C.V., et al; NISC Comparative Sequencing Program. (2011).

TTC21B contributes both causal and modifying alleles across

the ciliopathy spectrum. Nat. Genet. 43, 189–196.

29. Kumar, P., Henikoff, S., and Ng, P.C. (2009). Predicting the

effects of coding non-synonymous variants on protein func-

tion using the SIFT algorithm. Nat. Protoc. 4, 1073–1081.

30. Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gera-

simova, A., Bork, P., Kondrashov, A.S., and Sunyaev, S.R.

(2010). A method and server for predicting damaging

missense mutations. Nat. Methods 7, 248–249.

31. Mougou-Zerelli, S., Thomas, S., Szenker, E., Audollent, S.,

Elkhartoufi, N., Babarit, C., Romano, S., Salomon, R., Amiel,

J., Esculpavit, C., et al. (2009). CC2D2A mutations in Meckel

and Joubert syndromes indicate a genotype-phenotype corre-

lation. Hum. Mutat. 30, 1574–1582.

32. Bandyopadhyay, S., Chiang, C.Y., Srivastava, J., Gersten, M.,

White, S., Bell, R., Kurschner, C., Martin, C.H., Smoot, M.,

Sahasrabudhe, S., et al. (2010). A human MAP kinase interac-

tome. Nat. Methods 7, 801–805.

33. Ganesan, A.K., Kho, Y., Kim, S.C., Chen, Y., Zhao, Y., and

White, M.A. (2007). Broad spectrum identification of SUMO

substrates in melanoma cells. Proteomics 7, 2216–2221.

34. Huang, W., Zhou, Z., Asrar, S., Henkelman, M., Xie, W., and

Jia, Z. (2011). p21-Activated kinases 1 and 3 control brain

size through coordinating neuronal complexity and synaptic

properties. Mol. Cell. Biol. 31, 388–403.

35. Wilkinson, K.A., Nakamura, Y., and Henley, J.M. (2010).

Targets and consequences of protein SUMOylation in

neurons. Brain Res. Brain Res. Rev. 64, 195–212.

36. Yotova, V., Labuda, D., Zietkiewicz, E., Gehl, D., Lovell, A.,

Lefebvre, J.F., Bourgeois, S., Lemieux-Blanchard, E., Labuda,

M., Vezina, H., et al. (2005). Anatomy of a founder effect:

Myotonic dystrophy in Northeastern Quebec. Hum. Genet.

117, 177–187.

37. Roddier, K., Thomas, T., Marleau, G., Gagnon, A.M., Dicaire,

M.J., St-Denis, A., Gosselin, I., Sarrazin, A.M., Larbrisseau, A.,

Lambert, M., et al. (2005). Two mutations in the HSN2 gene

explain the high prevalence of HSAN2 in French Canadians.

Neurology 64, 1762–1767.

700 The American Journal of Human Genetics 90, 693–700, April 6, 2012

REPORT

Mutations in ROGDI Cause Kohlschutter-Tonz Syndrome

Anna Schossig,1,3,14 Nicole I. Wolf,2,4,14 Christine Fischer,3 Maria Fischer,5 Gernot Stocker,5

Stephan Pabinger,5 Andreas Dander,5 Bernhard Steiner,6 Otmar Tonz,6 Dieter Kotzot,1 Edda Haberlandt,7

Albert Amberger,1 Barbara Burwinkel,8,9 Katharina Wimmer,1 Christine Fauth,1

Caspar Grond-Ginsbach,10 Martin J. Koch,11 Annette Deichmann,12 Christof von Kalle,12

Claus R. Bartram,3 Alfried Kohlschutter,13 Zlatko Trajanoski,5 and Johannes Zschocke1,3,*

Kohlschutter-Tonz syndrome (KTS) is an autosomal-recessive disease characterized by the combination of epilepsy, psychomotor regres-

sion, and amelogenesis imperfecta. The molecular basis has not yet been elucidated. Here, we report that KTS is caused by mutations in

ROGDI. Using a combination of autozygosity mapping and exome sequencing, we identified a homozygous frameshift deletion,

c.229_230del (p.Leu77Alafs*64), in ROGDI in two affected individuals from a consanguineous family. Molecular studies in two addi-

tional KTS-affected individuals from two unrelated Austrian and Swiss families revealed homozygosity for nonsense mutation

c.286C>T (p.Gln96*) and compound heterozygosity for the splice-site mutations c.531þ5G>C and c.532-2A>T in ROGDI, respectively.

The latter mutation was also found to be heterozygous in the mother of the Swiss affected individual in whom KTS was reported for

the first time in 1974. ROGDI is highly expressed throughout the brain and other organs, but its function is largely unknown. Possible

interactions with DISC1, a protein involved in diverse cytoskeletal functions, have been suggested. Our finding that ROGDI mutations

cause KTS indicates that the protein product of this gene plays an important role in neuronal development as well as amelogenesis.

Kohlschutter-Tonz syndrome (KTS, MIM 226750) is a rare

genetic disorder characterized by the combination of

epilepsy, psychomotor delay and regression, and amelogen-

esis imperfecta. So far, 24 individuals with the clinical diag-

nosis of KTS have been reported.1–9 Pedigrees suggest an

autosomal-recessive mode of inheritance, but genetic

heterogeneity cannot be excluded. The molecular basis of

KTS has not yet been elucidated. The most striking feature

is global enamel deficiency (amelogenesis imperfecta) of

the hypoplastic or hypocalcified type; this deficiency affects

primary as well as permanent teeth right from the moment

of eruption. The enamel is very thin, rough, prone todisinte-

gration, and stained in various shades of brown. Onset of

epilepsy usually occurs in the first year of life; seizures are

difficult to treat or might be refractory to therapy. Affected

children show severe psychomotor delay or regression,

which might be present after birth but more frequently

develops after the onset of seizures. Both gross and fine

motor skills are usually impaired, and intellectual disability

might be severe. The natural course is variable; several

affected individuals developed spastic tetraplegia, and

somedied in childhood.There arenoconsistentdysmorphic

features or metabolic abnormalities, although nonspecific

facial anomalieshavebeenreported insomeaffected individ-

uals. Cranial imaging frequently shows mild brain atrophy.

In order to identify the genetic basis of KTS, we investi-

gated four affected children from three families as well as

healthy members of the index family reported in 1974.1

Clinical features of the affected individuals are summa-

rized in Table 1. Family A is a consanguineous Moroccan

family with two affected children (A-IV:3 and A-IV:4;

Figure 1);9 the parents are first cousins. Initial development

of the affected boy (A-IV:3) appeared normal, but treat-

ment-resistant epilepsy started when he was 4 months

old and led to loss of fixation and global developmental

delay. The affected younger sister (A-IV:4) showed psycho-

motor delay from birth onward. Epileptic seizures, which

were difficult to treat, started when she was 12 months

old. The first teeth in both children erupted when they

were 13 and 14 months old, respectively; from the begin-

ning, their teeth were lusterless and had a brownish dis-

coloration. Family B has been reported previously;8 the

parents of the affected boy (B-II:1) are not knowingly

related but come from neighboring villages in East Tyrol

(Austria). Epilepsy started when the boy was 5 months

old but later improved; there were no seizures after 7 years

of age, and medication was discontinued when he was

15 years old. Primary and permanent teeth were yellow,

hypoplastic, and crowded. Family C has one affected girl

(C-XI:2) who has not yet been reported. Left-sided hemi-

convulsive seizures started when she was 6 months old

and were initially difficult to treat, but when she was

6 years old, anticonvulsive treatment could be discontin-

ued. Primary and secondary dentition showed enamel

1Division of Human Genetics, Medical University Innsbruck, 6020 Innsbruck, Austria; 2Department of Child Neurology, VU University Medical Center,

1007 MB Amsterdam, The Netherlands; 3Institute of Human Genetics, Heidelberg University, 69120 Heidelberg, Germany; 4Department of Child

Neurology, Heidelberg University, 69120 Heidelberg, Germany; 5Division of Bioinformatics, Medical University Innsbruck, 6020 Innsbruck, Austria; 6Chil-

dren’s Hospital, 6000 Lucerne, Switzerland; 7Department of Pediatrics, Medical University Innsbruck, 6020 Innsbruck, Austria; 8German Cancer Research

Center, 69120 Heidelberg, Germany; 9Department of Obstetrics and Gynecology, Heidelberg University, 69120 Heidelberg, Germany; 10Department of

Neurology, Heidelberg University, 69120 Heidelberg, Germany; 11Department of Oral, Dental, and Maxillofacial Diseases, Heidelberg University, 69120

Heidelberg, Germany; 12National Center for Tumor Diseases and German Cancer Research Center, 69120 Heidelberg, Germany; 13University Hospital

for Child and Adolescent Medicine, 20246 Hamburg, Germany14These authors contributed equally to this work

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.012. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 701–707, April 6, 2012 701

abnormalities typical of KTS (Figure 2). Genealogical

studies revealed that this girl is distantly related to the

mother of the affected individuals (C-IX:3) reported in

19741 via both the maternal line (six generations ago)

and the paternal line (nine generations ago). The parents

of individual C-XI:2 are also eighth cousins (see family C

in Figure 1).

In order to identify the candidate gene for KTS, we per-

formed linkage analysis and autozygosity mapping in

family A. Analyses of all families were carried out with

informed consent and were approved by the institutional

review board at Medical University Innsbruck. Affected

individuals, siblings, parents, and grandparents were in-

vestigated. We analyzed 250 ng of genomic DNA from

each individual on the SNP-based mapping-chip Gene-

Chip HumanMapping 10K Array (Affymetrix, Santa Clara,

CA, USA); we used the operating software (Affymetrix

GCOS 1.4) and genotyping-analysis software (Affymetrix

GTYP 4.0) according to the manufacturer’s instructions.

A multipoint LOD score was calculated with the software

programs Allegro10 and ALOHOMORA.11 The haplotype

analysis and the LOD-score estimation based on the model

of autosomal-recessive inheritance showed four possible

linkage regions in chromosomal regions 3q13.31–q13.32,

11q24.1–q24.2, 16p13.3, and 17q25.1–q25.3 (Figure 3A).

LOD scores in these regions ranged between 1.05 and

2.06. The autozygous regions had a total size of 15.83 Mb

and contained 326 known protein-coding genes (see

Table S1).

Considering the large number of genes in the autozy-

gous regions, we decided to use whole-exome sequencing

(carried out by ServiceXS, Leiden, The Netherlands) for

the genetic analysis of one affected individual (A-IV:4)

from family A. Exome capturing was performed with the

Agilent SureSelect Human All Exon Kit (Agilent, Santa

Clara, CA), and the sample was sequenced on an Illumina

Genome Analyzer II platform (Illumina, San Diego, CA).

Data analysis was carried out with the SIMPLEX pipeline,

which uses the Burrows Wheeler Aligner12 to map the

reads to the human reference-genome sequence (USCS

Table 1. Clinical Features of KTS and ROGDI Genotypes in the Affected Individuals

A-IV:3 A-IV:4 B-II:1 C-XI:2

ROGDI genotype homozygous forc.229_230del(p.Leu77Alafs*64)

homozygous forc.229_230del(p.Leu77Alafs*64)

homozygous forc.286C>T (p.Gln96*)

compound heterozygousfor c.531þ5G>Cand c.532-2A>T

Age at time of lastevaluation

12 years 9 years 18 years 9 years

Growth parameters mild microcephaly normal normal normal

Initial development normal until onsetof seizures

developmentaldelay since birth

normal until onsetof seizures

normal until onsetof seizures

Language skills andsocial interaction attime of last evaluation

no expressivelanguage

some words;deterioration of socialinteraction afteronset of seizures

35 single words andsentences with two words;social and friendly behavior

competent to talk inshort and simple sentences

Age of walkingwithout support

4.5 years 2.2 years 2.5 years 2 years

Age of seizure onset 4 months 12 months 5 months 6 months

EEG findings(generalized orpartial traits)

multifocal epilepticactivity and poorlydeveloped backgroundactivity

focal epileptic activityand poorly developedbackground activity

multifocal epilepticactivity (later generalized)and abnormal backgroundactivity

focal epileptic activity;normalization at6 years of age

Seizure type andfrequency

episodes of cyanosisand apnea; latergeneralized tonic-clonicseizures (1–5 per day);only seizures withfever since start oflevetiracetam at3.5 years of age

mostly myoclonicseizures (1–5 per day);only seizures withfever since start oflevetiracetamat 1.8 years of age

focal and generalizedseizures (1–5 per day);seizure free since 7 yearsof age and no medicationsince 15 years of age

left-sided hemiconvulsiveseizures and variousanticonvulsants; seizurefree without treatmentsince 6 years of age

Hearing normal normal normal normal

Vision loss of visual fixationafter onset of seizures

normal normal normal

Dentition eruption of first teethat 13 months of age;discoloration fromthe beginning

eruption of first teethat 14 months of age;lusterless and rapiddiscoloration

primary and permanentteeth with discolorationand enamel defects

primary and permanentteeth with discolorationand enamel defects

The following abbreviation is used: EEG, electroencephalography.

702 The American Journal of Human Genetics 90, 701–707, April 6, 2012

hg19, February 2009, Genome Reference Consortium

GRCh37). For SNP and DIP (deletion-insertion polymor-

phism) calling, as well as for realignment around indels,

we applied the Genome Analysis Toolkit (GATK).13 Exon

boundaries were specified by the Consensus Coding

Sequence (CCDS).10 An exome coverage depth of 233

was achieved: 46% of exons showed high coverage

(R203), and around 10% of exons showed low coverage

(%53). Variant detection identified 20,454 SNPs as well

as 1,208 DIPs. We annotated all variants with additional

information by using GATK and ANNOVAR14 to facilitate

the identification of disease-causing mutations. Subse-

quently, we applied the auto_annovar functionality to

filter variants against dbSNP (build 132), the 1,000

Genomes Project (Nov 2010), and previously assigned

conservation scores (for filtering details, see Table S2). After

all filtering steps, only a single strong candidate gene,

ROGDI (rogdi homolog [Drosophila], RefSeq accession

number NM_024589.1) in chromosomal region 16p13.3,

remained in the autozygous regions of interest. In exon 4

of this gene, we found a homozygous frameshift deletion,

c.229_230del (p.Leu77Alafs*64), which is predicted to

disrupt the amino acid structure and cause a premature

stop codon (Figure 3B). The filtering algorithm also called

a missense variant, c.2273G>C (p.Cys758Ser), in EVPL

(envoplakin [MIM 601590]) in the linkage region of

17q25.1; this variant (rs142251448) was included in build

134/135 of dbSNP and had a heterozygote frequency of

0.3% in the North American population.

After completing exome sequencing, we found a PhD

thesis that reports the results of autozygosity mapping in

five families affected by KTS.15 That study identified 30

candidate genes, including ROGDI, but did not find any

linkage to chromosomal region 17q25.1. There is no

other report on a possible link between KTS and ROGDI

or EVPL. Also considering the expected severity of the

frameshift deletion found in family A, we focused our

subsequent studies on ROGDI (Ensembl accession number

ENSG00000067836). This gene stretches over 5.98 kb in

chromosomal region 16p13.3 and contains 11 exons, all

of which are coding. Bioinformatics analysis showed that

the transcript of ROGDI codes for 287 amino acids and

results in a molecular weight of 32 kDa (RefSeq accession

number NP_078865.1). There is only one known func-

tional transcript. Dye-terminator sequencing of all exons

and adjacent intron sequences of ROGDI (NM_024589.1)

(ABI Prism 7000 sequence detection system, Applied Bio-

systems, Carlsbad, CA; primer sequences are available in

Table S3) confirmed the homozygous presence of the

mutation c.229_230del in both affected siblings of family

A (Figure 3C). As expected, both parents were found to

be heterozygous. Sequence analysis in family B revealed

Figure 1. Pedigrees of Investigated FamiliesThe pedigree for family A shows a consanguineous Moroccan family9 in which linkage analysis and exome sequencing were performed.The pedigree for family B shows a Tyrolean family affected by KTS,8 and the parents are not knowingly related. The pedigree for familyC shows the newly diagnosed Swiss family (the parents are X:1 and X:2) and its relationship with the distantly related index familyreported in 1974 (parents IX-3 und IX-4).1 Note that the parents in the newly identified family are distantly related to each other,but the affected child is compound heterozygous for two different mutations.

The American Journal of Human Genetics 90, 701–707, April 6, 2012 703

a homozygous nonsense mutation, c.286C>T, in exon 5 of

ROGDI in affected individual B-II:1 (Figure 3D). This muta-

tion is predicted to change a CAG triplet that codes for

glutamine into a TAG stop codon, denoted p.Gln96*.

Both parents in the family were heterozygous for this

mutation. In family C, two heterozygous splice-site muta-

tions, c.531þ5G>C and c.532-2A>T, in intron 7 of ROGDI

were identified in affected individual C-XI:2 (Figures 3E

and 3F). In silico analysis indicated that both mutations

destroy the respective splice donor and acceptor sites of

intron 7 (Alamut [Interactive Biosoftware, Rouen, France],

data not shown). The mother (C-X:2) was found to be

heterozygous for c.532-2A>T, and the father (C-X:1) was

found to be heterozygous for c.531þ5G>C, confirming

compound heterozygosity in the affected child. The unaf-

fected sister (C-XI:1) was found to be heterozygous for

c.531þ5G>C. Finally, we acquired archival DNA from

the unaffected mother (C-IX:3) and four healthy siblings

(C-X:3, C-X:8, C-X:10, and C-X:13) of the original family

reported by Kohlschutter et al.1 (family C in Figure 1);

none of these individuals have epilepsy and all have

normal intelligence and normal teeth with intact enamel.

The affected family members as well as the father of that

family are deceased, and their DNA samples are not avail-

able. The mother and all investigated siblings are heterozy-

gous for splice-site mutation c.532-2A>T, which is also

found in the mother of affected individual C-XI:2. It can

be assumed that the mothers from both family branches

have a common ancestor who lived in the Swiss valley of

Schachental in the 18th century and who was a carrier

for this mutation (family C in Figure 1).

Figure 2. Dental Phenotype in the So Far Unreported IndividualC-XI:2Tooth discoloration due to global enamel defect (amelogenesis im-perfecta).

Figure 3. Linkage and Genomic Sequence Analyses(A) Linkage analysis in family A revealed four autozygous regions in chromosomes 3, 11, 16, and 17.(B) Exome sequencing in family A revealed a homozygous 2 bp deletion, c.229_230del, in exon 4 of ROGDI.(C–F) Identification of mutations by Sanger sequencing. Homozygous deletion c.229_230del (C) is present in family A, homozygousnonsense mutation c.286C>T (D) is present in family B, and heterozygous splice-site mutations c.531þ5G>C (E) and c.532-2A>T (F)are present in family C.

704 The American Journal of Human Genetics 90, 701–707, April 6, 2012

All mutations identified were frameshift, nonsense, or

splice-site mutations that are expected to either cause

premature mRNA degradation by nonsense-mediated

decay or dramatically alter protein structure and conse-

quently cause complete loss of protein function. They are

not listed in publicly available genome-variant databases

and are absent from the 1,000 Genomes Project. In order

to assess the functional effects of the different mutations,

we obtained fresh peripheral-blood samples from the

affected individuals in families B and C. Peripheral-blood

mononuclear cells (PBMC) were isolated from blood

samples and cultivated in the presence of phytohemagglu-

tinin (Quantum PBL by PAA Laboratories GmbH, Pasch-

ing, Austria) for three days. Thereafter, RNA was isolated,

and cDNA synthesis was performed by standard methods.

RT-PCR amplification spanning exons 6–9 of the tran-

scripts in affected individual C-XI:2 (primer sequences

are available as Table S4) showed that the wild-type

amplicon (386 bp) was absent but that a strong shorter

band (approximately 290 bp) and a weak band somewhat

larger than the wild-type band were present (Figure 4A).

The other family members showed the wild-type ampli-

con, but the father and sister (both heterozygous for

c.531þ5G>C) also showed the shorter band, and the

mother (heterozygous for c.532-2A>T) also showed the

weak larger band. Dye-terminator sequencing of these

products revealed that the short band reflects an in-frame

deletion of exon 7 caused by mutation c.531þ5G>C

(Figure 4C). The other amplicon associated with mutation

c.532-2A>T was detectable as background sequencing

trace in the mother and the affected child. The aberrant

transcript is a result of the use of an intron 7 cryptic splice

acceptor site that leads to the inclusion of an additional

83 nucleotides before exon 8 (data not shown). The pre-

dicted effect is the inclusion of two abnormal amino acids

followed by a stop codon. cDNA sequence analysis was not

performed in affected individual B-II:1, who is homozy-

gous for the nonsense mutation c.286C>T, which is not

expected to affect splicing.

We quantified the expression of ROGDI with real time

PCR by using specific primers spanning exons 3–4 (primer

sequences are available in Table S4) and Maxima SYBR

Green/ROX qPCR Master Mix (Fermentas) in an Applied

Biosystems Prism 7000 sequence detection system. PCR

reaction was carried out under standard conditions.

The cycle threshold (Ct) values were calculated with

A

B

C

Figure 4. cDNA Analyses(A) RT-PCR analysis of ROGDI in family C. Note the absence of the wild-type amplicon as well as the presence of two aberrant bands inaffected individual C-XI:2. One of the aberrant bands is approximately 100 bp shorter than the wild-type band and is also found in thefather (C-X:1) and sister (C-XI:1), who are both heterozygous for c.531þ5G>C. The other aberrant band is weak, approximately 80 bplarger than the wild-type band, and is also observed in the mother (C-X:2), who is heterozygous for c.532-2A>T.(B) RT-qPCR analysis of ROGDI in affected individuals, healthy family members, and controls shows markedly reduced mRNA transcriptin affected individual B-II:1. Heterozygosity for c.532-2A>T in C-X:2 is associated with a cDNA reduction of approximately 50%, mostlikely reflecting nonsense-mediated decay of that allele. In contrast, heterozygosity for c.531þ5G>C is not associated with the loss ofcDNA in C-X:1 and C-XI:1. The fact that affected individual C-XI:2 has half normal cDNA reflects the combination of both alleles.The error bars represent means and standard deviations of three independence measurements of the probands and four controls.(C) cDNA sequence analysis of the RT-PCR product of exons 6–9 in individual C-X:1, heterozygous for c.531þ5G>C, shows skipping ofin-frame ROGDI exon 7.

The American Journal of Human Genetics 90, 701–707, April 6, 2012 705

sequence-detection system (SDS) software v1.2 (Applied

Biosystems). We quantified relative gene expression with

the comparative DDCt method by using HPRT1 (RefSeq

accession number NM_000194.2) as a reference gene.

These analyses showed that the amount of ROGDI cDNA

was markedly reduced to 10.6% (much lower than the

mean of the four controls) in affected individual B-II:1

(Figure 4B). The amount of cDNA in affected individual

C-XI:2 was 43.6%, similar to the value of 46.9% in her

mother (C-X:2). The amount of ROGDI cDNA in the father

and sister was in the normal range (86.6% and 100.2%,

respectively).

In summary, the cDNA analyses confirm that the muta-

tions in affected individuals B-II:1 and C-XI:2 severely

disrupt the normal ROGDI transcript. Mutation

c.531þ5G>C causes skipping of in-frame exon 7 but

does not lead to a translational frameshift and is not asso-

ciated with nonsense-mediated decay. In contrast, muta-

tion c.532-2A>T triggers the use of a cryptic intronic splice

acceptor site, explaining both a larger size of the cDNA

amplicon and nonsense-mediated decay. The latter effect

was also observed for nonsense mutation c.286C>T.

Thus, the mutations in all three KTS-affected families are

expected to be severe (null) mutations that are likely to

cause complete loss of ROGDI function.

The exact function of the protein encoded by ROGDI

is unknown. Using ANNIE,16 sequence-structure analysis

showed neither relevant features (e.g., transmembrane

regions or signal peptides) nor relevant protein domains.

Protein prediction methods17 indicate that ROGDI is a

globular protein and that the secondary structure consists

of 45% helixmotifs, 37% loop structures, and 17% strands.

The gene is highly conserved and has orthologs in many

species, including Drosophila melanogaster. It shows partic-

ularly high expression levels in various human brain

regions,18 in line with the CNS phenotype of KTS. A

Drosophila mutant of this gene showed a possible defi-

ciency in olfactory memory.19 Yeast two-hybrid screens20

suggested a possible interaction between ROGDI and

DISC1 (MIM 605210), a protein implicated in the develop-

ment of schizophrenia and involved in processes of cyto-

skeletal stability and organization, neuronal migration,

intracellular transport, and cell division.21 There are no

published studies that examined the role of ROGDI in

tooth development and amelogenesis. Our own data

provide robust information on the clinical effects of the

loss of ROGDI function in humans and provide interesting

perspectives for research into the molecular causes of

epilepsy and other conditions.

In conclusion, we report that KTS is caused by putative

loss-of-function mutations in ROGDI. All mutations

identified are predicted to be severe (null) mutations that

are likely to cause complete loss of protein function.

Heterozygosity for ROGDI-null mutations does not appear

to have any adverse effects. It is possible that individuals

with homozygosity or compound heterozygosity for

hypomorphic missense mutations in ROGDI could present

with isolated epilepsy independently from minor enamel

defects or vice versa. Assessing potential genotype-pheno-

type correlations will require molecular studies on addi-

tional affected individuals. Although we found ROGDI

mutations in all KTS-affected individuals investigated so

far, we cannot rule out genetic heterogeneity. Future

work will hopefully elucidate the exact function of ROGDI

in neuronal development and amelogenesis.

Supplemental Data

Supplemental Data include four tables and can be found with this

article online at http://www.cell.com/AJHG.

Acknowledgments

This work was supported by a grant from the Standortagentur

Tirol. We wish to thank Josef Muheim (Greppen, Switzerland)

for considerable help with the genealogical studies that allowed

us to link the two Swiss nuclear families into a single pedigree.

Long-term medical care to the affected individuals in the study

was provided by Thomas Schmitt-Mechelke and Petra Kolditz,

(both from Children’s Hospital, Lucerne, Switzerland). Bart

Janssen and Thomas Chin-A-Woeng (both from ServiceXS,

Leiden, The Netherlands) assisted with the exome sequencing.

We gratefully acknowledge expert technical assistance by Brunhild

Schagen (Department of Oral, Dental, and Maxillofacial Diseases,

Heidelberg University, Germany) as well as by Pia Traunfellner,

Sandra Unterkirchner, and Ramona Berberich (all from the Divi-

sion of Human Genetics, Medical University Innsbruck, Austria).

Received: December 23, 2011

Revised: January 31, 2012

Accepted: February 15, 2012

Published online: March 15, 2012

Web Resources

The URLs for data presented herein are as follows:

dbSNP, http://www.ncbi.nlm.nih.gov/snp/

Ensembl, http://www.ensembl.org

GenBank, http://www.ncbi.nlm.nih.gov/genbank/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

References

1. Kohlschutter, A., Chappuis, D., Meier, C., Tonz, O., Vassella, F.,

and Herschkowitz, N. (1974). Familial epilepsy and yellow

teeth—a disease of the CNS associated with enamel hypo-

plasia. Helv. Paediatr. Acta 29, 283–294.

2. Christodoulou, J., Hall, R.K., Menahem, S., Hopkins, I.J., and

Rogers, J.G. (1988). A syndrome of epilepsy, dementia, and

amelogenesis imperfecta: Genetic and clinical features. J.

Med. Genet. 25, 827–830.

3. Petermoller, M., Kunze, J., and Gross-Selbeck, G. (1993).

Kohlschutter syndrome: Syndrome of epilepsy—dementia—

amelogenesis imperfecta. Neuropediatrics 24, 337–338.

706 The American Journal of Human Genetics 90, 701–707, April 6, 2012

4. Zlotogora, J., Fuks, A., Borochowitz, Z., and Tal, Y. (1993).

Kohlschutter-Tonz syndrome: Epilepsy, dementia, and amelo-

genesis imperfecta. Am. J. Med. Genet. 46, 453–454.

5. Musumeci, S.A., Elia, M., Ferri, R., Romano, C., Scuderi, C.,

and Del Gracco, S. (1995). A further family with epilepsy,

dementia and yellow teeth: The Kohlschutter syndrome. Brain

Dev. 17, 133–138, discussion 142–133.

6. Wygold, T., Kurlemann, G., and Schuierer, G. (1996). Kohl-

schutter syndrome—an example of a rare progressive neuroec-

todermal disease. Case report and review of the literature. Klin.

Padiatr. 208, 271–275.

7. Donnai, D., Tomlin, P.I., and Winter, R.M. (2005). Kohlschut-

ter syndrome in siblings. Clin. Dysmorphol. 14, 123–126.

8. Haberlandt, E., Svejda, C., Felber, S., Baumgartner, S., Gunther,

B., Utermann, G., and Kotzot, D. (2006). Yellow teeth,

seizures, and mental retardation: A less severe case of Kohl-

schutter-Tonz syndrome. Am. J. Med. Genet. A. 140, 281–283.

9. Schossig, A., Wolf, N., Grond-Ginsbach, C., Schagen, B.,

Koch, M., Rating, D., and Zschocke, J. (2007). Epileptische

Enzephalopathie und Zahnschmelzdefekt (Kohlschutter-

Tonz-Syndrom): Drei Fallberichte und Literaturubersicht.

Med. Genetik 19, 422–426.

10. Pruitt, K.D., Tatusova, T., andMaglott, D.R. (2007). NCBI refer-

ence sequences (RefSeq): A curated non-redundant sequence

database of genomes, transcripts and proteins. Nucleic Acids

Res. 35 (Database issue), D61–D65.

11. Ruschendorf, F., and Nurnberg, P. (2005). ALOHOMORA:

A tool for linkage analysis using 10K SNP array data. Bioinfor-

matics 21, 2123–2125.

12. Li, H., and Durbin, R. (2009). Fast and accurate short read

alignment with Burrows-Wheeler transform. Bioinformatics

25, 1754–1760.

13. DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire,

J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A.,

Hanna, M., et al. (2011). A framework for variation discovery

and genotyping using next-generation DNA sequencing data.

Nat. Genet. 43, 491–498.

14. Wang, K., Li, M., and Hakonarson, H. (2010). ANNOVAR:

Functional annotation of genetic variants from high-

throughput sequencing data. Nucleic Acids Res. 38, e164.

15. Lo, C. (2009). Genetics in Epilepsy. PhD thesis, University

College London, London, UK.

16. Ooi, H.S., Kwo, C.Y.,Wildpaner, M., Sirota, F.L., Eisenhaber, B.,

Maurer-Stroh, S., Wong, W.C., Schleiffer, A., Eisenhaber, F.,

and Schneider, G. (2009). ANNIE: Integrated de novo protein

sequence annotation. Nucleic Acids Res. 37 (Web Server issue),

W435–W440.

17. Rost, B., Yachdav, G., and Liu, J. (2004). The PredictProtein

server. Nucleic Acids Res. 32 (Web Server issue), W321–W326.

18. Wu, C., Orozco, C., Boyer, J., Leglise, M., Goodale, J., Batalov,

S., Hodge, C.L., Haase, J., Janes, J., Huss, J.W., 3rd, and Su, A.I.

(2009). BioGPS: An extensible and customizable portal for

querying and organizing gene annotation resources. Genome

Biol. 10, R130.

19. Dubnau, J., Chiang, A.S., Grady, L., Barditch, J., Gossweiler, S.,

McNeil, J., Smith, P., Buldoc, F., Scott, R., Certa, U., et al.

(2003). The staufen/pumilio pathway is involved in

Drosophila long-term memory. Curr. Biol. 13, 286–296.

20. Camargo, L.M., Collura, V., Rain, J.C., Mizuguchi, K., Hermja-

kob, H., Kerrien, S., Bonnert, T.P., Whiting, P.J., and Brandon,

N.J. (2007). Disrupted in Schizophrenia 1 Interactome:

Evidence for the close connectivity of risk genes andapotential

synaptic basis for schizophrenia. Mol. Psychiatry 12, 74–86.

21. Brandon, N.J., and Sawa, A. (2011). Linking neurodevelop-

mental and synaptic theories of mental illness through

DISC1. Nat. Rev. Neurosci. 12, 707–722.

The American Journal of Human Genetics 90, 701–707, April 6, 2012 707

REPORT

A Nonsense Mutation in the Human Homologof Drosophila rogdi Causes Kohlschutter–Tonz Syndrome

Adi Mory,1,2 Efrat Dagan,2,3 Barbara Illi,4 Philippe Duquesnoy,5 Shikma Mordechai,2 Ishai Shahor,6

Sveva Romani,4 Nivin Hawash-Moustafa,2 Hanna Mandel,1,6 Enza M. Valente,4 Serge Amselem,5

and Ruth Gershoni-Baruch1,2,*

Kohlschutter–Tonz syndrome (KTS) is a rare autosomal-recessive disorder of childhood onset, and it is characterized by global develop-

mental delay, spasticity, epilepsy, and amelogenesis imperfecta. In 12 KTS-affected individuals from a Druze village in northern Israel,

homozygosity mapping localized the gene linked to the disease to a 586,513 bp region (with a LOD score of 6.4) in chromosomal region

16p13.3. Sequencing of genes (from genomic DNA of an affected individual) in the linked region revealed chr16: 4,848,632 G>A, which

corresponds to ROGDI c.469C>T (p.Arg157*). The nonsensemutation was homozygous in all affected individuals, heterozygous in 10 of

100 unaffected individuals from the same Druze community, and absent from Druze controls from elsewhere. Wild-type ROGDI local-

izes to the nuclear envelope; ROGDI was not detectable in cells of affected individuals. All affected individuals suffered seizures, were

unable to speak, and had amelogenesis imperfecta. However, age of onset and the severity of mental and motor handicaps and that

of convulsions varied among affected individuals homozygous for the same nonsense allele.

Kohlschutter–Tonz syndrome (KTS) (MIM 226750) is

described as a rare autosomal-recessive neurodegenerative

disorder characterized by progressive dementia, spasticity,

and epilepsy.1,2 A clinical marker of KTS is a generalized

enamel defect, amelogenesis imperfecta, which is most

obvious as yellowed teeth. KTS was first identified in fami-

lies from Switzerland,1,2 Sicily,3 the Druze community of

northern Israel,4 and, subsequently, other locations in

western Europe.5–9 To date, only 21 affected individuals

have been reported. Seizures, intellectual impairment,

and amelogenesis imperfecta were reported in all families,

and other clinical features varied among reports. The goal

of this project was to identify the gene and the mutation

responsible for KTS in the highly consanguineous Druze

community.

We have compiled 14 new KTS cases pertaining to five

families, all of which originate from the same small Druze

village in northern Israel. The index case (II-1 in family 1),

who was referred to us at the age of 13 months for the eval-

uation of seizures, was noted to have amelogenesis imper-

fecta. Individual II-1 and her parents and siblings, as well

as informative relatives of four other families with affected

children, were enrolled (Figure S1, available online). A

consanguineous liaison is evident for families 3, 4, and 5,

although the parents of the affected children in family 4

are not known to be related. Family 2, unrelated to families

3–5, is consanguineous too (Figure S1). The study was

approved by the institutional review board at Rambam

Health Care Campus, Haifa, and after signed informed

consent (self and parental), a blood sample was drawn

for DNA extraction from all available family members

(both affected and healthy individuals). All affected indi-

viduals were clinically evaluated by a pediatric clinical

geneticist, medical records were reviewed, and parents

were interviewed.

The clinical characteristics of 14 KTS-affected individuals

(seven males and seven females; ages 2–24 years) are de-

picted in Table 1. Born at term after normal gestation,

they all appeared normal at birth and had no apparent dys-

morphology.AlthoughallKTScasesultimatelydisplayedan

unequivocal phenotype heralded by seizures and ‘‘yellow

teeth,’’ they varied widely with regard to the severity of

the manifestations, even within the same nuclear family.

Family 4 has five affected children (V-4 to V-8) who all

had epileptic episodes that varied in age of onset, intensity,

frequency, and response to treatment. The firstborn child

(V-4), who died at the age of 2 years, was vegetative, failed

to thrive, and suffered from intractable convulsions and

microcephaly. Their second-born affected child (V-6) dis-

played impaired psychomotor development from the

first months of life and convulsive episodes, starting at

9 months of age, that were refractory to treatment. She

was nonverbal and nonambulant. It was noted that her

brother (V-5) lagged developmentally starting at 6 months

of age, and he is regularly maintained on anticonvulsants

(he has had a partial response). At 16.5 years of age, he is

awkwardly ambulant and nonverbal, performs mostly by

shouting and yelling, and is irritable and self-mutilating.

His 15-year-old sister (V-7) suffers from a convulsive

disorder that responds well to treatment, and, although

intellectually disabled, she manages to communicate

with her mother, utters a few words, and is ambulant.

The youngest sibling (V-8), currently 3.5 years old, has

a convulsive disorder that is only partially controlled,

1The Ruth and Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, 31096 Haifa, Israel; 2Institute of Human Genetics, Rambam

Health Care Campus, 31096 Haifa, Israel; 3Department of Nursing, Faculty of Social Welfare and Health Sciences, University of Haifa, 31905 Haifa, Israel;4CSS-Mendel Institute, viale Regina Margherita 261, 00198 Rome, Italy; 5Institut National de la Sante et de la Recherche Medicale U.933 and Universite

Pierre et Marie Curie, Hopital Armand-Trousseau, 75012 Paris, France; 6Department of Pediatrics, Rambam Health Care Campus, 31096 Haifa, Israel

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.03.005. �2012 by The American Society of Human Genetics. All rights reserved.

708 The American Journal of Human Genetics 90, 708–714, April 6, 2012

Table 1. Clinical Characteristics of 14 Individuals Affected by KTS

Family 1 Family 2 Family 3 Family 4 Family 5

II-1 II-6 II-7 IV-1 IV-2 IV-4 V-1 V-3 V-4 V-5 V-6 V-7 V-8 V-12

Gender female female male female male male male female female male female female male male

Birth weight (grams) 3,500 3,500 3,500 3,250 3,250 3,500 2,500 2,400 2,300 3,140 2,800 3,500 3,800 3,500

Deliverya NVD NVD NVD NVD NVD NVD CS CS NVD NVD NVD NVD NVD NVD

Current age (years) 6 24 16 15.5 16.5 13.5 14.5 16.5 deceasedat age 2

16.5 19.5 15 3.5 4.5

Clinical characteristics

Age of first convulsion(months)

13 12 9 12 6.5 42 9 9 birth 10 9 9 9 11

Seizure intensityb þ þþþ þþ þ þþþ þ þþþ þ þþþ þþ þþþ þ þ þþþ

Resistance to therapyc þ þþþ þþ þ þþþ þ þþþ þ þþþ þþ þþþ þ þ þþþ

Amelogenesis imperfecta yes yes yes yes yes yes yes yes N/A yes yes yes yes yes

Intellectual impairmentd þþþ þþþþ þþþ þþþ þþþ þþ þþþ þþ N/A þþþ þþþþ þþ þþþ þþþ

Speech no no no no no mumbling no mumbling N/A no no mumbling no no

Ambulant yes no yes yes yes yes no yes N/A yes no yes yes yes

Laboratory evaluation

EEGe N N/A N N/A abnormal abnormal N/A N/A N/A abnormal N/A N abnormal N

Brain MRIf abnormal abnormal abnormal N/A N N/A N/A N/A N/A abnormal N/A N/A N N

The following abbreviations were used: NVD, normal vaginal delivery; CS, cesarean section; N/A, not available; EEG, electroencephalogram; N, normal; and MRI, magnetic resonance imaging.aAll individuals were born at term.bSeizure intensity: þ, occasional; þþ, frequent; and þþþ, severe.cTherapy resistance: þ, good response to treatment; þþ, control by treatment; and þþþ, refractory.dIntellectual impairment: þ, mild; þþ, moderate; þþþ, severe; and þþþþ, profound.eAn abnormal EEG revealed a pattern of slow, short epileptiform spikes and waves during light sleep and was normal thereafter.fAn abnormal MRI revealed dilation of cerebellar sulci and third and lateral ventricles.

TheAmerica

nJournalofHumanGenetics

90,708–714,April

6,2012

709

and he is hyperactive, nonverbal, relatively ambulant, and

mentally disabled.

A wide interfamilial variability was again noted in

families 1, 2, and 3. Family 2 has three affected children

(IV-1, IV-2, and IV-4). The first affected boy (IV-2), who

was hypotonic and suckedpoorly as a neonate, experienced

intractable seizures beginning at 6.5 months of age. The

youngest boy (IV-4) was considered normal until, at the

age of three years, he experienced a convulsive episode.

Accordingly, IV-4 performs better than his older siblings.

Affected children from family 1 (II-1, II-6, and II-7) and

family 3 (V-1 and V-3) all presented with seizures at around

one year of age (between 9 and 13 months). The severity of

the epileptic events ranged from short convulsive episodes

that resolved under antiepileptic treatment (such as with

individuals II-1 and V-3) to generalized convulsions

partially controlled by anticonvulsants (II-7) to frequent

convulsive episodes resistant to therapy (II-6 and V-1).

Intellectual disability was noted to correlate with the

severity of the epileptic events. The phenotype displayed

by our KTS cases is consistent but variable. All cases dis-

played amelogenesis imperfecta, seizures, severe develop-

mental delay, and lack of speech. The magnitude of the

intellectual disability, although severe to profound, is still

directly related to the severity of the convulsive disorder

as manifested by age of onset and response to treatment.

Affected individuals lost their gross motor skills in either

early or late adolescence because they became spastic. The

most afflicted individualswerebedriddenearly in life.Meta-

bolicworkupwasnegative as a rule. Electroencephalograms

(EEGs) undertaken shortly after a convulsive episode were

interpreted, in most cases, as normal. Magnetic resonance

images (MRIs) were available for seven individuals. In four

of these individuals, dilatation of cerebellar sulci and third

and lateral ventricles was evident and should probably be

regarded as a correlate of cerebral atrophy.

With this in mind, the term ‘‘dementia’’ previously used

for defining a major characteristic of KTS is not valid for

the cases described here. KTS-affected children have global

developmental delay that is cortical in nature (intellectual

disability, spasticity, and seizures), and the progressive

nature of their neurological decline might be attributed

to, among other things, the intractability of their seizures

and, as such, epileptic encephalopathy.

Homozygosity mapping was carried out with the

assumptions that the disease follows a recessive model

and that a single responsible allele is shared by all affected

individuals. First, genomic DNA from ten affected individ-

uals, one healthy sibling, and one obligate carrier parent

was genotyped with the Illumina 6000 SNP array (BioRap

Technologies at the Rappaport Institute); thereafter, five

affected individuals were genotyped with the Affimetrix

GeneChip Human Mapping 250K Nsp microarray (Biolog-

ical Services Unit at the Weizmann Institute). Homozy-

gosity-by-descent analysis was carried out manually and

explored identical homozygous intervals in all affected

individuals. Candidate homozygous loci were genotyped

with microsatellite markers derived from Marshfield

maps or with markers designed by us (our markers were

based on the Tandem Repeats Finder program and the

UCSC Human Genome Database). Haplotypes of family

members were manually constructed and analyzed with

SUPERLINK online. In the 6000 SNP array, the analysis

identified a candidate segment, in chromosomal region

16p13.3, spanning 2,670,304 bp between rs2075852

and rs1012259; only one homozygous SNP (rs85930)

shared homozygosity in all ten affected individuals,

whereas the healthy mother and sibling were heterozy-

gous. Evaluating this interval in the 250K SNP array,

we identified a homozygous segment spanning only

873,963 bp between rs11865087 and rs3760030; this

segment was shared by four of the five genotyped individ-

uals. The fifth affected individual was clinically reevaluated

and excluded from the research as a non-KTS case (data

not shown). We developed and genotyped the following

three microsatellite markers in this interval: GT32 at

chr16: 4,533,109–4,533,282 (marker 1); TG19 at chr16:

4,854,835–4,855,072 (marker 2); and GT31 at chr16:

5,105,840–5,106,041 (marker 3). We depict haplotypes in

Figure 1 to show the segregation of markers and SNPs in

affected individuals and healthy family members; there

are three affected and seven healthy individuals in family

1, three affected and two healthy individuals in family 2,

and six affected and eight healthy individuals in families

3–5 (Figure 1A). Individual V-4 (family 4), who died at

2 years of age, and individual V-8 (family 4), who has not

yet been diagnosed with KTS at this stage, were excluded

from the analysis. All affected individuals shared the

same homozygous haplotype for markers 2 and 3, and

a 586,513 bp segment between microsatellite marker 1

(chr16: 4,533,109) and rs3760030 (chr16: 5,119,872) was

defined as the linkage locus for KTS (Figure 1A). Under

a recessive model of full penetrance, the LOD score for

linkage of this region to KTS was 6.4.

Chromosomal region 16p13.3 harbors 17 genes. We

performed Sanger sequencing of coding regions and flank-

ing intron-exon boundaries on genomic DNA from one

affected individual (II-1 in family 1) by using the primers

that we designed. The genomic sequences were retrieved

from the UCSC Genome Browser (GRCh37/hg19 assembly

[Feb. 2009]). In the first ten genes, no potentially

damaging variants were found. In contrast, sequencing

of theDrosophila rogdi homolog, ROGDI (FLJ22386) (RefSeq

accession number NM_024589.1), yielded a homozygous

nonsensemutation (c.469C>T in exon 7) causing a prema-

ture stop codon, p.Arg157* (Figure 1B). This nonsense

mutation cosegregated with KTS in all five families; all

affected individuals were homozygous for the mutation,

all parents were heterozygous, and unaffected siblings

were either heterozygous or homozygous for the wild-

type allele. Of 100 unaffected adults from the same Druze

community, ten were heterozygous carriers of the muta-

tion; none were homozygous for the mutation. The muta-

tion was not observed in 100 Druze individuals from other

710 The American Journal of Human Genetics 90, 708–714, April 6, 2012

Figure 1. Families 1–5 Haplotypes and the c.469C>T Mutation in Exon 7 of ROGDI(A) Disease-associated haplotypes are shown in boxes. Markers 1 and rs3760030 define the minimal homozygosity locus associated withthe disease (allele 0: not genotyped). The numbers flanking the genotyped markers indicate the distance from 16pter (GRCh37/hg19assembly).(B) The ROGDI c.469C>T (p.Arg157*) mutation in genomic DNA of a KTS-affected individual compared to a control.

The American Journal of Human Genetics 90, 708–714, April 6, 2012 711

parts of Israel. Primer sequences and PCR conditions are

available upon request. The c.469C>T mutation was not

reported in the Exome Variant Server. No clinical links

were reported for SNPs in the gene.

ROGDI (FLJ22386) is the human homolog of Drosophila

rogdi. The human gene encodes a 287 amino acid leucine

zipper protein of unknown function. Drosophila rogdi

encodes 343 and 268 amino acid isoforms. The stop at

Figure 2. Subcellular Localization of ROGDI in Transfected HEK 293 Cells and Blood Mononuclear Cells(A–F) The immunostaining of ROGDI-transfected HEK 293 cells was performedwith a ROGDI polyclonal antibody incubated in the pres-ence of a permeabilizing reagent (0.2% saponin). ROGDI labeling (A) was revealed with an Alexa Fluor-488 goat anti-rabbit secondaryantibody (green). Nuclei (B) were stained with DAPI (blue). The merged picture (C) shows the ROGDI-antibody staining (green) togetherwith nuclei staining (blue). As a control, the same experiment was performed with the secondary antibody in the absence of ROGDIantibody (D, E, and F).(G–R) The same ROGDI antibody was used for immunolocalization of native ROGDI in blood mononuclear cells (green signal in G, K,and O). Colabeling was performed with a LAMIN A monoclonal antibody (red signal in H, L, and P). Nuclei (I, M, and Q) were stainedwith DAPI (blue). The partial colocalization of ROGDI with LAMIN A is shown in (J), (N), and (R). Cells were observed by confocalmicroscopy. Three cell sections from the middle to the top of cells are shown (G–J, K–N, and O–R, respectively). White scale bars repre-sent 10mm.

712 The American Journal of Human Genetics 90, 708–714, April 6, 2012

residue 157 of the human homolog corresponds to residue

156 of the shorter isoform and residue 231 of the longer

isoform in Drosophila. The most conserved domain of the

protein is the C terminus (residues 253–281 of the human

protein), shared by the human protein and bothDrosophila

isoforms and truncated by the stop mutation in the KTS-

affected families. Meta-analysis of genome-wide expres-

sion studies indicates that in both humans and mice,

ROGDI is expressed more in the hippocampus than in

other tissues.10 Our RT-PCR analysis indicates that ROGDI

is widely expressed and has higher levels in the adult brain,

spinal cord, peripheral blood, heart, and bone marrow but

lower (and still detectable) levels in many other tissues,

including the fetal brain (Figure S2).

Examination of the primary sequence of ROGDI does not

provide any clues about the subcellular localization of the

protein. To address this issue, we generated an expression

plasmid encoding wild-type ROGDI (pROGDI) after ampli-

fication of the full-length ROGDI cDNA by PCR with cDNA

from an adult human brain as a template (Clontech).

Cellular localization of ROGDI was thereafter evaluated

in three experiments. Human embryonic kidney (HEK)

293 cells were transfected with pROGDI and were exam-

ined by indirect immunofluorescence micoroscopy with

a rabbit anti-ROGDI polyclonal antibody (1 mg/ml rabbit

anti-ROGDI polyclonal antibody, Protein Tech Group,

Chicago, IL, USA; Catalog No. 17047-1-AP) generated

against the entire protein and a 1:1,000 dilution of

secondary Alexa Fluor-488 (green) goat anti-rabbit anti-

body (Molecular Probes); the HEK 293 cells revealed

a strong nuclear labeling of multiple bright spots and virtu-

ally no cytoplasmic staining (Figures 2A–2F). Blood mono-

nuclear cells—treated with the same antibodies and

a 1:1,000 dilution of a mouse monoclonal LAMIN A

antibody (Abcam, Cambridge, UK) and conjugated with

a secondary Alexa Fluor-594 (red) goat anti-mouse anti-

body (Molecular Probes)—were counterstained with DAPI

and examined by confocal microscopy (Nikon D-Eclipse

C1 with EZ-C1 3.91 software) (Figures 2G–2R). Native

ROGDI colocalized with the nuclear envelope marker

LAMIN A (Figures 2J, 2N, and 2R), suggesting again that

ROGDI might belong to the nuclear envelope. The same

procedure was undertaken for the labeling of dermal fibro-

blasts cultured from individual II-1 and the control, except

that in this case, the ROGDI and LAMIN A antibodies

were conjugated with secondary Alexa Fluor-594 (red)

and Alexa Fluor-488 (green) antibodies (Molecular Probes),

respectively. As shown in Figures 3A–3C, this experiment

revealed that in dermal fibroblasts, ROGDI localizes to

the nucleus, and a strong labeling of the nuclear envelope

is associated with faint spots within the nucleus. Most

importantly, ROGDI was not detected in the fibroblasts

from affected individual II-1 (Figures 2D–3F). Consistent

with these data, the protein was also not detected by

immunoblotting (Figure 3G). These latter results confirm

the specificity of the labeling obtained with the ROGDI

antibody used in these experiments and are consistent

with the loss-of-function mutation (p.Arg157*) identified

in individuals with KTS.

ROGDI emerges as new player in neurogenesis. The

expression pattern of the gene, showing strong expression

in the adult brain and spinal cord, is in line with the

disease characteristics relevant to cortical dysfunction

and spasticity. However, its detectable expression in

many other sites that do not seem to be affected by the

disease raises the question of its physiological role in these

tissues. ROGDI (FLJ22386) has been reported to interact

with a protein called disrupted in schizophrenia 1 (DISC1)

(MIM 605210) in yeast two-hybrid screens.11 DISC1 is

deemed necessary for neuronal proliferation, the migra-

tion of cortical interneurons, and their proper differentia-

tion in the cerebral cortex.12 A plausible interaction

between ROGDI and DISC1, if confirmed by a proper

experimental set such as coimmunoprecipitation studies

in native cells, could offer a clue regarding the role of

ROGDI in the pathophysiology of KTS.

In summary, 14 KTS-affected individuals from a consan-

guineous Druze community share homozygosity for

a nonsense mutation in the human homolog of Drosophila

rogdi, which encodes a leucine zipper protein of unknown

function. The nonsense mutation would truncate a

highly conserved C-terminal domain. Mammalian ROGDI

is highly expressed in the hippocampus, and the

Figure 3. Immunostaining in Dermal Fibroblasts and Immuno-blot Analysis from a KTS-Affected Individual and a Control(A–F) Double immunostaining of LAMIN A and ROGDI in control(A–C) and KTS (D–F) fibroblasts (II-1 in family 1). LAMIN A (green)and ROGDI (red) were labeled with specific antibodies. Mergedimages are shown in (C) and (F).(G) Immunoblot shows absence of ROGDI in Epstein-Barr virus(EBV)-transformed lymphoblasts from affected individualscompared to controls. Lanes 1 and 2 show EBV-transformedlymphoblasts of controls. Lanes 3 and 4 display EBV-transformedlymphoblasts from affected individuals (II-1 in family 1; V-8 infamily 4). The upper bands indicate a molecular weight of~32 kDa. Tubulin was used as a loading control.

The American Journal of Human Genetics 90, 708–714, April 6, 2012 713

corresponding protein localizes to the nuclear envelope.

Age of onset and the severity of the degenerative KTS

phenotype vary considerably among individuals who are

homozygous for the same disease allele and who are

from the same small community.

Supplemental Data

Supplemental Data include two figures and can be found with this

article online at http://www.cell.com/AJHG.

Received: February 24, 2012

Revised: March 13, 2012

Accepted: March 15, 2012

Published online: April 5, 2012

Web Resources

The URLs for data presented herein are as follows:

BLAST, http://blast.ncbi.nlm.nih.gov

GeneBank, http://www.ncbi.nlm.nih.gov/nuccore/nm_024589.1

GeneCards, http://www.genecards.org/

Mutalyzer, http://www.mutalyzer.nl/2.0

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

PRIMER3, http://frodo.wi.mit.edu/primer3/

SNP Database, http://www.ncbi.nlm.nih.gov/snp/

SuperLink, http://bioinfo.cs.technion.ac.il/superlink-online/

Tandem Repeat Finder, http://tandem.bu.edu/trf/trf.html

UCSC, http://genome.ucsc.edu

References

1. Kohlschutter, A., Chappuis, D., Meier, C., Tonz, O., Vassella, F.,

and Herschkowitz, N. (1974). Familial epilepsy and yellow

teeth—a disease of the CNS associated with enamel hypo-

plasia. Helv. Paediatr. Acta 29, 283–294.

2. Witkop, C.J., Jr., and Sauk, J.J., Jr. (1976). Heritable defects

of enamel. In Oral Facial Genetics, R.E. Stewart and G.H.

Prescott, eds. (St. Louis: C.V. Mosby), pp. 200–202.

3. Christodoulou, J., Hall, R.K., Menahem, S., Hopkins, I.J., and

Rogers, J.G. (1988). A syndrome of epilepsy, dementia, and

amelogenesis imperfecta: Genetic and clinical features. J.

Med. Genet. 25, 827–830.

4. Zlotogora, J., Fuks, A., Borochowitz, Z., and Tal, Y. (1993).

Kohlschutter-Tonz syndrome: epilepsy, dementia, and amelo-

genesis imperfecta. Am. J. Med. Genet. 46, 453–454.

5. Petermoller, M., Kunze, J., and Gross-Selbeck, G. (1993). Kohl-

schutter syndrome: Syndrome of epilepsy—dementia—ame-

logenesis imperfecta. Neuropediatrics 24, 337–338.

6. Musumeci, S.A., Elia, M., Ferri, R., Romano, C., Scuderi, C.,

and Del Gracco, S. (1995). A further family with epilepsy,

dementia and yellow teeth: The Kohlschutter syndrome. Brain

Dev. 17, 133–138, discussion 142–143.

7. Wygold, T., Kurlemann, G., and Schuierer, G. (1996). Kohl-

schutter syndrome—an example of a rare progressive neuroec-

todermal disease. Case report and review of the literature. Klin.

Padiatr. 208, 271–275.

8. Donnai, D., Tomlin, P.I., and Winter, R.M. (2005). Kohlschut-

ter syndrome in siblings. Clin. Dysmorphol. 14, 123–126.

9. Haberlandt, E., Svejda, C., Felber, S., Baumgartner, S., Gunther,

B., Utermann, G., and Kotzot, D. (2006). Yellow teeth,

seizures, and mental retardation: A less severe case of Kohl-

schutter-Tonz syndrome. Am. J. Med. Genet. A. 140, 281–283.

10. Kapushesky, M., Adamusiak, T., Burdett, T., Culhane, A.,

Farne, A., Filippov, A., Holloway, E., Klebanov, A., Kryvych,

N., Kurbatova, N., et al. (2012). Gene Expression Atlas

update—a value-added database of microarray and

sequencing-based functional genomics experiments. Nucleic

Acids Res. 40 (Database issue), D1077–D1081.

11. Camargo, L.M., Collura, V., Rain, J.C., Mizuguchi, K., Hermja-

kob, H., Kerrien, S., Bonnert, T.P., Whiting, P.J., and Brandon,

N.J. (2007). Disrupted in Schizophrenia 1 Interactome:

Evidence for the close connectivity of risk genes and a

potential synaptic basis for schizophrenia. Mol. Psychiatry

12, 74–86.

12. Kamiya, A., Kubo, K.I., Tomoda, T., Takaki, M., Youn, R.,

Ozeki, Y., Sawamura, N., Park, U., Kudo, C., Okawa, M.,

et al. (2005). A schizophrenia-associated mutation of DISC1

perturbs cerebral cortex development. Nat. Cell Biol. 7,

1167–1178.

714 The American Journal of Human Genetics 90, 708–714, April 6, 2012

REPORT

Maternal Inheritance of a PromoterVariant in the Imprinted PHLDA2 GeneSignificantly Increases Birth Weight

Miho Ishida,1 David Monk,1,5 Andrew J. Duncan,1 Sayeda Abu-Amero,1 Jiehan Chong,1,6

Susan M. Ring,2 Marcus E. Pembrey,1,2 Peter C. Hindmarsh,1 John C. Whittaker,3 Philip Stanier,4

and Gudrun E. Moore1,*

Birth weight is an important indicator of both perinatal and adult health, but little is known about the genetic factors contributing to its

variability. Intrauterine growth restriction is a leading cause of perinatal morbidity and mortality and is also associated with adult

disease. A significant correlation has been reported between lower birth weight and increased expression of the maternal PHLDA2 allele

in term placenta (the normal imprinting pattern wasmaintained). However, a mechanism that explains the transcriptional regulation of

PHLDA2 on in utero growth has yet to be described. In this study, we sequenced the PHLDA2 promoter region in 263 fetal DNA samples

to identify polymorphic variants. We used a luciferase reporter assay to identify in the PHLDA2 promoter a 15 bp repeat sequence (RS1)

variant that significantly reduces PHLDA2-promoter efficiency. RS1 genotyping was then performed in three independent white Euro-

pean normal birth cohorts. Meta-analysis of all three (total n ¼ 9,433) showed that maternal inheritance of RS1 resulted in a significant

93 g increase in birth weight (p ¼ 0.01; 95% confidence interval [CI] ¼ 22–163). Moreover, when the mother was homozygous for RS1,

the influence on birthweight was 155 g (p¼ 0.04; 95%CI¼ 9–300), which is a similarmagnitude to the reduction in birth weight caused

by maternal smoking.

Very low birth weight shows a strong association with

perinatal mortality and morbidity and is linked to an

increased risk of developing adulthood diseases, such as

obesity and type 2 diabetes (MIM 125853).1,2 Fetal growth

relies on an effective nutrient supply from the mother to

the fetus via the placenta; this nutrient supply is in-

fluenced by a complex interrelationship between the

environment and genetics. Of particular interest are

imprinted genes, which show expression from only one

allele in a parent-of-origin dependent manner. Genomic

imprinting is found almost exclusively in placental

mammals. Its evolution is probably best explained by

the ‘‘conflict hypothesis,’’ which suggests that paternally

expressed imprinted genes promote fetal growth and

ensure inheritance of the paternal genome to successive

generations, whereas maternally expressed imprinted

genes limit growth in order for the mother to survive

and reproduce again.3

PHLDA2 (MIM 602131) encodes the pleckstrin

homology-like domain, family A, member 2 protein and

is a maternally expressed imprinted gene found in one of

the most extensively studied imprinting clusters in human

chromosomal region 11p15.5. Consistent with the

‘‘conflict hypothesis,’’ Phlda2-null mice exhibit placenta

overgrowth, whereas doubling the Phlda2 expression in

transgenic mice results in placental stunting accompanied

by a 13% reduction in fetal weight; both of these findings

suggest that Phlda2 has a growth-suppressing role.4,5 In hu-

mans, PHLDA2 is expressed in a variety of tissues but is

predominantly expressed in the villous cytotrophoblast

of the placenta throughout gestation,6,7 and upregulation

has been observed in intrauterine growth restriction

(IUGR) placentas.8–10 This complements our previous

finding that PHLDA2 expression is significantly higher in

the term placenta of lower-birth-weight babies.11 However,

sequence analysis of all informative samples in the ‘‘Moore

cohort’’ of white European normal births confirmed that

only maternal, monoallelic PHLDA2 expression was

present.11 This indicates that loss of imprinting (LOI) was

not responsible for the increased PHLDA2 expression and

suggests that additional regulatory mechanisms, including

the PHLDA2 promoter, other than imprinting must be

involved.

In this study, we examined the PHLDA2 promoter region

for genetic polymorphisms that might affect PHLDA2 tran-

scriptional activity and therefore could affect birth weight.

From the Moore cohort (n ¼ 263), recruited from Queen

Charlotte and Chelsea Hospital,11 we sequenced a ~2 kb

upstream region beginning at the transcription start site

and overlapping the promoter CpG island. The UCSC

Genome Browser (build GRCh37/hg19) listed 20 SNPs,

encompassing rs12798267 to rs412300, in this region.

1Clinical andMolecular Genetics Unit, Institute of Child Health, University College London, LondonWC1N 1EH, UK; 2Avon Longitudinal Study of Parents

and Children, Department of Social Medicine, Oakfield House, Oakfield Grove, University of Bristol, Bristol BS8 2BN, UK; 3Noncommunicable Disease

Epidemiology Unit, London School of Hygiene and Tropical Medicine, University of London, LondonWC1E 7HT, UK; 4Neural Development Unit, Institute

of Child Health, University College London, London WC1N 1EH, UK5Present address: Imprinting and Cancer Group, Epigenetics and Cancer Biology Program, Bellvitge Institute for Biomedical Research, L’Hospitalet de Llo-

bregat, Barcelona 08907, Spain6Present address: Ipswich Hospital NHS Trust, Ipswich, IP4 5PD, UK

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.021. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 715–719, April 6, 2012 715

However, none of these SNPs were identified in this cohort,

suggesting that they either are rare in the white European

population or have not been accurately validated. We only

detected one variable sequence: a tandem 15 bp (50-GGGG

CGGGGAGGGGC- 30; bp 4,934–4,967 of NG_009266.1)

repeat sequence (RS) variant present 48 bp upstream of

the PHLDA2 transcription start site (Figure S1, available

online). The tandem repeat (RS2) is most common (it is

present in 87% of chromosomes), and the minor allele is

a single copy (RS1) that is found in the remaining 13% of

chromosomes (Figure 1). In addition, RS1 was not found

to be in linkage disequilibrium (LD) with nearby SNPs

rs3847646 (located ~3 kb upstream) or rs13390 or

rs1056819 (present within PHLDA2 exons 1 and 2,

respectively).

We investigated the effect of the PHLDA2 RS on the

gene’s promoter activity by transiently transfecting lucif-

erase reporter constructs into the transformed human

embryonic kidney (HEK) 293T cell line and the human

trophoblast cell line 1 (TCL-1). We made the promoter

constructs by cloning 300 bp (with the use of HindIII

and Xho1) and 600 bp (with the use of HindIII and Sac1)

DNA fragments upstream of the PHLDA2 start site and in-

serting them into the pGL3.1-Basic vector (Promega, UK).

These sequences contained either RS1 or RS2. RS1 showed

significantly lower PHLDA2 promoter activity for both the

300 bp (74% decrease; t test, p ¼ 0.004) and 600 bp (42%

decrease; t test, p ¼ 0.001) constructs (Figure 2). The exper-

iments were performed in duplicate for HEK293T cells and

in triplicate for TCL-1 cells (Figure S2), and each assay

included six replicates. TFSEARCH (Transcriptional Factor

Search) shows that the RS2 allele potentially harbors four

SP1 and two MZF1 binding sites. However, losing a 15 bp

copy (RS1) removes three of these sites, suggesting the

possibility that the number of available transcription-

factor binding sites might be important for promoter

efficiency.

Given that high expression of PHLDA2 is associated with

lower birth weight11 and that RS1 reduces the PHLDA2

Figure 1. Sequence ElectropherogramsShowing the 15 bp RS in the PHLDA2 PromoterRegionRS1/RS1 homozygous, RS2/RS2 homozygous,and RS1/RS2 heterozygous sequences are shown.Each black bar represents the location of a single15 bp copy. The start of the overlapping sequencein the heterozygous sample is indicated by theblack arrow and dotted bars.

promoter efficiency in vitro, we investi-

gated whether RS1 could be associated

with increased birth weight. Because

PHLDA2 is maternally expressed and pater-

nally silenced, RS1 homozygotes and

heterozygotes with a maternally inherited

RS1 were grouped and deemed the ‘‘RS1

effect group.’’ RS2 homozygotes and hetero-

zygotes with a paternally inherited RS1 were named the

‘‘unaffected group’’ (Table S1). We assessed the parental

origin of the RS1 allele in the heterozygous babies by

genotyping their corresponding parental DNA samples.

Uninformative cases were those in which both parents

were heterozygotes.

We first genotyped the parental DNA samples corre-

sponding to the heterozygous babies in the Moore cohort,

and 28 babies were revealed to have maternally inherited

RS1 (Table 1). To investigate the effect of maternally in-

herited RS1 on birth weight, we applied a linear-regression

model and corrected for the following covariates: the

gender of the baby,maternal weight, gestational age, parity

and maternal diabetes, hypertension, and smoking habits

(Table S2). A two-tailed test was used throughout; p values

were based on Wald tests, and standard residual plots were

examined and showed no evidence of departure from

model assumptions. This analysis showed that babies in

the RS1 effect group tended to have an average birth

weight 122 g higher than that of the babies in the unaf-

fected group (p ¼ 0.15; 95% confidence interval [CI] ¼�43–286) (Table 1). We then carried out the same analysis

on the UCL-FGS cohort (baby-parent trios, n ¼ 385) from

the University College London Fetal Growth Study.12

This produced a similar trend to the Moore cohort: babies

in the RS1 effect group (n ¼ 16) were on average 68 g

heavier (p ¼ 0.61; 95% CI ¼ �196–332) (Table 1). The

reproducibility of this trend suggests a potentially valid

finding despite the fact that statistical confidence could

not be achieved because too few individuals (approxi-

mately 13%) had maternal RS1 (Table 1).

To address this, we then introduced a third and larger

collection, the ALSPAC cohort (n ¼ 8,785) from the Avon

Longitudinal Study of Parents and Children study.13

Because this cohort only includes samples from the

mother and child and because PHLDA2 is a maternally ex-

pressed transcript, the RS1 effect group (n¼ 179) consisted

of homozygous RS1/RS1 babies and heterozygous babies

with homozygous RS1/RS1 mothers (Table S3). Using the

716 The American Journal of Human Genetics 90, 715–719, April 6, 2012

same analysis as previously described, we found that

maternal inheritance of RS1 in this cohort results in an

average 88 g increase in the baby’s birth weight (p ¼0.03; 95% CI ¼ 6–170) (Table 1). We then performed

a meta-analysis to combine the data from all three cohorts

by using both fixed- and random-effects models. The

results from both models showed that babies inheriting

maternal RS1 have a 93 g heavier birth weight (p ¼ 0.01;

95% CI¼ 22–163) (Figure 3). No evidence of heterogeneity

was found across the studies (p ¼ 0.92, I2 ¼ 0%). In addi-

tion, no evidence of association was found between birth

weight and paternally inherited RS1, consistent with the

imprinting of PHLDA2. Medical records and clinical data

for all three cohorts were obtained with informed consent,

and the study was approved by the ALSPAC Law and Ethics

Committee and the local research ethics committees of

Hammersmith and Queen Charlotte’s and Chelsea

Hospital Trust and University College London.

The meta-analysis indicated that the fetal genotypes had

a direct influence on the babies0 birth weight; therefore, we

Figure 2. The Effects of RS1 and RS2 on PHLDA2Promoter Efficiency in HEK 293T CellsThe bars indicate the firefly luciferase expressionrelative to Renilla luciferase activity in HEK 293Tcells for the 300 bp and 600 bp constructs witheither RS2 or RS1. Luciferase activity wasmeasured 30 hr after transfection. This datashows the mean of six replicate samples 5 SEM(standard error of the mean). Asterisks representp < 0.05.

Table 1. Baby Genotypes and Influence on Birth Weight: Individual Studies and Meta-Analysis

Study RS2/RS2 RS1/RS1 RS2/RS1 Pa Mb RS1 Effect Group Effect Estimate (g) 95% CI (g) p value

Moore 193 4 66 22 24 28 122 �43–286 0.15

UCL-FGS 292 5 88 20 11 16 68 �196–332 0.61

ALSPAC 6,649 128 2,008 465 51 179 88 6–170 0.03*

Combinedc 7,134 137 2,162 507 86 223 93 22–163 0.01*

The RS1 effect group consists of babies with maternally inherited RS1. RS1/RS2 heterozygous babies with heterozygous parents are uninformative for the parentalorigin of RS1 and were therefore removed from the analysis. All effect estimates (g) have been adjusted for the following covariates: gender, parity, maternalweight, gestational age, maternal smoking, diabetes, and hypertension. The observed genotype frequency had no evidence of deviation from the Hardy-Wein-berg equilibrium. Asterisks represent p< 0.05. Three further alleles with different numbers of repeats were identified at the PHLDA2 RS locus in an extremely smallnumber of individuals (n ¼ 25) from the ALSPAC cohort and were thus excluded from the statistical analysis.aThe number of heterozygous babies with paternally inherited RS1.bThe number of heterozygous babies with maternally inherited RS1.cThe meta-analysis of all three cohorts.

further investigated the ALSPAC cohort

to see whether the maternal genotypes

would have an effect on the babies0 birthweight. Because we cannot determine the

parental origin of the RS1 allele of the

heterozygous mothers without the grand-

parents0 samples, we instead compared the

effect of three maternal genotype groups

on the babies0 birth weight by using the

homozygous RS2/RS2 group (n ¼ 465) as

the baseline. To test this, we used a linear-

regression model corrected for the same covariates

described in the previous analysis. Our analysis showed

that the heterozygous group (n ¼ 529) had a low impact

on the babies0 birth weight (þ0.3 g; p ¼ 0.99; 95% CI ¼�69–70); this result was expected because half of the

babies should inherit a paternal RS1. However, when

the mothers were homozygous for RS1 (n ¼ 61), the babies

were found to be 155 g heavier (p¼ 0.04; 95% CI¼ 9–300),

indicating that maternal genotypes have an additional

influence on fetal growth potentially through the intra-

uterine environment. This change is of similar magnitude

to the reduction caused by maternal smoking (Table S2).

Notably, the effect of heterozygous mothers on birth

weight was not midway between each homozygote, even

though half would be expected to carry the maternal

RS1. Instead, the homozygous RS1/RS1 group had consid-

erably more than twice the effect on birth weight than did

the heterozygous group. This suggests a three-generation

cumulative effect, given that a homozygous RS1/RS1

mother also inherits maternal RS1 from her mother.

The American Journal of Human Genetics 90, 715–719, April 6, 2012 717

Alternatively, homozygous RS1/RS1 mothers could also

affect babies0 birth weight by influencing the circulating

PHLDA2 protein/mRNA levels in the maternal blood,

given that PHLDA2 shows biallelic expression in adult

blood.14

Maternal inheritance of RS1 did not affect the placental

weight (þ2.5 g; p ¼ 0.93; 95% CI ¼ �62–67) but did

have a small and statistically-significant influence on

head circumference (þ0.23 cm; p ¼ 0.04; 95% CI ¼ 0.01–

0.45). Interestingly, although the RS1 sequence is con-

served in monkeys, the duplicated RS2 allele seems to be

exclusive to humans (Figure S1). This implies an evolu-

tionary role in human reproductive success. Consistent

with the conflict hypothesis, maternal PHLDA2 RS1 is asso-

ciated with both increased growth of the baby and head

circumference. Conversely, the net effect of the common

(RS2/RS2) allele in humans is limited birth weight and

head circumference, an effect which might provide an

evolutionary advantage—protecting the mother and her

birth canal.

Given the perinatal and life-long health complications

associated with very low birth weight,1,2,15 a number of

studies have investigated the genetic contribution of puta-

tive growth-regulating genes, including the imprinted

genes IGF2 (MIM 147470) and H19 (MIM 103280).16–19

Genome-wide linkage or association studies have located

several loci associated with birth weight,20–23 although

none have yet been directly associated with actual gene

function. PHLDA2 has not previously been detected in

these screens, perhaps as a result of the complexity intro-

duced by the parent-of-origin effect but also because

PHLDA2 RS1 is not in linkage disequilibrium with nearby

SNPs and is therefore not well-represented on the genotyp-

ing platforms used. In addition, our study maximizes the

information content for this allele because we specifically

genotyped all informative individuals for what is essen-

tially the functional and presumably causal variant. The

biochemical function of PHLDA2 remains unknown. It is

a small cytoplasmic protein that binds to phosphoinosi-

tide lipids via its PH domain.24 A recent study showed a

relationship between PHLDA2 expression and lower

growth velocity of the fetal femur; this relationship

suggests that PHLDA2 possibly plays a role in bone devel-

opment.25 Although increased Phlda2 expression in

transgenic mice resulted in a smaller placenta and a corre-

sponding reduction in birth weight,5 we could not

replicate this finding on the human placenta either in

a comparative study with PHLDA2 expression11 or indi-

rectly via association with the promoter RS genotype.

Nevertheless, a profound effect on the babies0 birth weight

was still detected, suggesting that placental weight was not

the predominant regulatory factor. It also appears that

this effect is controlled through the maternal expression

of the gene, which is consistent with the conflict hypoth-

esis and is mediated by the maternal genetic inheritance at

the DNA level in the promoter. This provides the first

example of a maternal genetic effect working together

with a maternally driven epigenetic effect. We suspect it

will be the first of many examples once further details of

the interactions of the genome with the epigenome are

unraveled. The PHLDA2 promoter RS and its expression

might serve as a useful genetic biomarker that can be

used to predict birth size. Further insight into the function

of PHLDA2 along with other imprinted genes will help us

understand the genetic basis of fetal growth as well as

the common and serious complications—such as IUGR—

of pregnancy.

Supplemental Data

Supplemental Data include three figures and three tables and can

be found with this article online at http://www.cell.com/AJHG.

Acknowledgments

Wewould like to thankM. Sweeney for her help at the sequencing

facility at the Institute of Neurology and all the members of

Professor Moore’s Development and Growth research group for

valuable suggestions and help. This research was funded by the

Child Health Research Appeal Trust (the Institute of Child Health

and theGreat Ormond Street Hospital for Children [GOSH]), Over-

seas Research Studentship (M.I.), the Medical Research Council,

Wellbeing of Women, March of Dimes, PARKS, and the GOSH

Charity. P.S. is supported by the GOSH Charity. We are extremely

grateful to all the familieswho took part in this study, themidwives

for their help in recruiting the families, and the whole ALSPAC

(Avon Longitudinal Study of Parents and Children) team, which

includes interviewers, computer and laboratory technicians, cler-

ical workers, research scientists, volunteers, managers, reception-

ists, and nurses. The UK Medical Research Council, the Wellcome

Trust, and the University of Bristol provide core support for

ALSPAC. We would also like to thank the University College Lon-

don Fetal Growth Study cohort team for their collaborative work.

Received: December 21, 2011

Revised: February 17, 2012

Accepted: February 22, 2012

Published online: March 22, 2012

Figure 3. Meta-Analysis Showing the Relationship betweenBirth Weight and PHLDA2 Promoter RS1 EffectThe data is depicted in a Forest plot; the 95% CI for each study isrepresented by a horizontal line, and the estimated effect sizes areshown as gray squares. The weight of the study in the meta-anal-ysis is represented by the size of the squares. The scale used is ingrams (g). The diamond shape indicates the mean and 95% CIfor the total estimate of the effect. Both random- and fixed-effectmodels produced the same results, and the plot represents theresults from the fixed-effect model.

718 The American Journal of Human Genetics 90, 715–719, April 6, 2012

Web Resources

The URLs for data presented herein are as follows:

ALSPAC, http://www.bristol.ac.uk/alspac

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

TFSEARCH, http://www.cbrc.jp/research/db/TFSEARCH

UCSC Genome Browser, http://genome.ucsc.edu

References

1. Barker, D.J. (2004). The developmental origins of adult disease.

J. Am. Coll. Nutr. 23 (6, Suppl), 588S–595S.

2. McIntire, D.D., Bloom, S.L., Casey, B.M., and Leveno, K.J.

(1999). Birth weight in relation to morbidity and mortality

among newborn infants. N. Engl. J. Med. 340, 1234–1238.

3. Moore, T., and Haig, D. (1991). Genomic imprinting in

mammalian development: A parental tug-of-war. Trends

Genet. 7, 45–49.

4. Frank, D., Fortino,W., Clark, L., Musalo, R., Wang,W., Saxena,

A., Li, C.M., Reik, W., Ludwig, T., and Tycko, B. (2002).

Placental overgrowth in mice lacking the imprinted gene Ipl.

Proc. Natl. Acad. Sci. USA 99, 7490–7495.

5. Tunster, S.J., Tycko, B., and John, R.M. (2010). The imprinted

Phlda2 gene regulates extraembryonic energy stores. Mol.

Cell. Biol. 30, 295–306.

6. Saxena, A., Frank, D., Panichkul, P., Van den Veyver, I.B.,

Tycko, B., and Thaker, H. (2003). The product of the imprinted

gene IPL marks human villous cytotrophoblast and is lost in

complete hydatidiform mole. Placenta 24, 835–842.

7. Qian, N., Frank, D., O’Keefe, D., Dao, D., Zhao, L., Yuan, L.,

Wang, Q., Keating, M., Walsh, C., and Tycko, B. (1997). The

IPL gene on chromosome 11p15.5 is imprinted in humans

and mice and is similar to TDAG51, implicated in Fas expres-

sion and apoptosis. Hum. Mol. Genet. 6, 2021–2029.

8. McMinn, J., Wei, M., Schupf, N., Cusmai, J., Johnson, E.B.,

Smith, A.C., Weksberg, R., Thaker, H.M., and Tycko, B.

(2006). Unbalanced placental expression of imprinted

genes in human intrauterine growth restriction. Placenta 27,

540–549.

9. Diplas, A.I., Lambertini, L., Lee, M.-J., Sperling, R., Lee, Y.L.,

Wetmur, J., and Chen, J. (2009). Differential expression of

imprinted genes in normal and IUGR human placentas.

Epigenetics 4, 235–240.

10. Kumar, N., Leverence, J., Bick, D., and Sampath, V. (2012).

Ontogeny of growth-regulating genes in the placenta.

Placenta 33, 94–99.

11. Apostolidou, S., Abu-Amero, S., O’Donoghue, K., Frost, J.,

Olafsdottir, O., Chavele, K.M., Whittaker, J.C., Loughna, P.,

Stanier, P., and Moore, G.E. (2007). Elevated placental expres-

sion of the imprinted PHLDA2 gene is associated with low

birth weight. J. Mol. Med. 85, 379–387.

12. Hindmarsh, P.C., Geary, M.P., Rodeck, C.H., Kingdom, J.C.,

and Cole, T.J. (2002). Intrauterine growth and its relationship

to size and shape at birth. Pediatr. Res. 52, 263–268.

13. Jones, R.W., Ring, S., Tyfield, L., Hamvas, R., Simmons, H.,

Pembrey, M., and Golding, J.; ALSPAC Study Team. (2000).

A new human genetic resource: A DNA bank established as

part of the Avon longitudinal study of pregnancy and child-

hood (ALSPAC). Eur. J. Hum. Genet. 8, 653–660.

14. Muller, S., van den Boom, D., Zirkel, D., Koster, H., Berthold,

F., Schwab, M., Westphal, M., and Zumkeller, W. (2000).

Retention of imprinting of the human apoptosis-related

gene TSSC3 in human brain tumors. Hum. Mol. Genet. 9,

757–763.

15. Simmons, R.A. (2009). Developmental origins of adult disease.

Pediatr. Clin. North Am. 56, 449–466.

16. Adkins, R.M., Somes, G., Morrison, J.C., Hill, J.B., Watson,

E.M., Magann, E.F., and Krushkal, J. (2010). Association of

birth weight with polymorphisms in the IGF2, H19, and

IGF2R genes. Pediatr. Res. 68, 429–434.

17. Gomes, M.V., Soares, M.R., Pasqualim-Neto, A., Marcondes,

C.R., Lobo, R.B., and Ramos, E.S. (2005). Association between

birth weight, body mass index and IGF2/ApaI polymorphism.

Growth Horm. IGF Res. 15, 360–362.

18. Petry, C.J., Ong, K.K., Barratt, B.J., Wingate, D., Cordell, H.J.,

Ring, S.M., Pembrey, M.E., Reik, W., Todd, J.A., and Dunger,

D.B.; ALSPAC Study Team. (2005). Common polymorphism

in H19 associated with birthweight and cord blood IGF-II

levels in humans. BMC Genet. 6, 22.

19. Petry, C.J., Seear, R.V., Wingate, D.L., Acerini, C.L., Ong, K.K.,

Hughes, I.A., and Dunger, D.B. (2011). Maternally transmitted

foetal H19 variants and associations with birth weight. Hum.

Genet. 130, 663–670.

20. Andersson, E.A., Pilgaard, K., Pisinger, C., Harder, M.N.,

Grarup, N., Faerch, K., Poulsen, P., Witte, D.R., Jørgensen, T.,

Vaag, A., et al. (2010). Type 2 diabetes risk alleles near

ADCY5, CDKAL1 and HHEX-IDE are associated with reduced

birthweight. Diabetologia 53, 1908–1916.

21. Arya, R., Demerath, E., Jenkinson, C.P., Goring, H.H., Puppala,

S., Farook, V., Fowler, S., Schneider, J., Granato, R., Resendez,

R.G., et al. (2006). A quantitative trait locus (QTL) on chromo-

some 6q influences birth weight in two independent family

studies. Hum. Mol. Genet. 15, 1569–1579.

22. Fradin, D., Heath, S., Lepercq, J., Lathrop, M., and Bougneres,

P. (2006). Identification of distinct quantitative trait Loci

affecting length or weight variability at birth in humans. J.

Clin. Endocrinol. Metab. 91, 4164–4170.

23. Freathy, R.M., Mook-Kanamori, D.O., Sovio, U., Prokopenko,

I., Timpson, N.J., Berry, D.J., Warrington, N.M., Widen, E.,

Hottenga, J.J., Kaakinen, M., et al; Genetic Investigation of

ANthropometric Traits (GIANT) Consortium; Meta-Analyses

of Glucose and Insulin-related traits Consortium; Wellcome

Trust Case Control Consortium; Early Growth Genetics

(EGG) Consortium. (2010). Variants in ADCY5 and near

CCNL1 are associated with fetal growth and birth weight.

Nat. Genet. 42, 430–435.

24. Saxena, A., Morozov, P., Frank, D., Musalo, R., Lemmon, M.A.,

Skolnik, E.Y., and Tycko, B. (2002). Phosphoinositide binding

by the pleckstrin homology domains of Ipl and Tih1. J. Biol.

Chem. 277, 49935–49944.

25. Lewis, R.M., Cleal, J.K., Ntani, G., Crozier, S.R., Mahon, P.A.,

Robinson, S.M., Harvey, N.C., Cooper, C., Inskip, H.M., God-

frey, K.M., et al; Southampton Women’s Survey Study Group.

(2012). Relationship between placental expression of the

imprinted PHLDA2 gene, intrauterine skeletal growth and

childhood bone mass. Bone 50, 337–342.

The American Journal of Human Genetics 90, 715–719, April 6, 2012 719

REPORT

Alzheimer Disease Susceptibility Loci:Evidence for a Protein Network under Natural Selection

Towfique Raj,1,2,3,4 Joshua M. Shulman,1,3,4 Brendan T. Keenan,1,4 Lori B. Chibnik,1,3,4

Denis A. Evans,5,6 David A. Bennett,5,6 Barbara E. Stranger,2,3,4 and Philip L. De Jager1,3,4,*

Recent genome-wide association studies have identified a number of susceptibility loci for Alzheimer disease (AD). To understand the

functional consequences and potential interactions of the associated loci, we explored large-scale data sets interrogating the human

genome for evidence of positive natural selection. Our findings provide significant evidence for signatures of recent positive selection

acting on several haplotypes carrying AD susceptibility alleles; interestingly, the genes found in these selected haplotypes can be assem-

bled, independently, into a molecular complex via a protein-protein interaction (PPI) network approach. These results suggest a possible

coevolution of genes encoding physically-interacting proteins that underlie AD susceptibility and are coexpressed in different tissues. In

particular, PICALM, BIN1, CD2AP, and EPHA1 are interconnected through multiple interacting proteins and appear to have coordinated

evidence of selection in the same human population, suggesting that they may be involved in the execution of a shared molecular func-

tion. This observationmay be AD-specific, as the 12 loci associated with Parkinson disease do not demonstrate excess evidence of natural

selection. The context for selection is probably unrelated to AD itself; it is likely that these genes interact in another context, such as in

immune cells, where we observe cis-regulatory effects at several of the selected AD loci.

Alzheimer disease (AD [MIM 104300]) is themost common

neurodegenerative disease and is a leading cause of

dementia.1 Sporadic late-onset AD has a genetic compo-

nent that includes the well-known ε4 haplotype of APOE

(MIM 107741) and other loci that harbor susceptibility

alleles.2–6 However, the functional consequences and

evolutionary history of these loci and their possible

interactions remain largely unknown. Previous studies

reporting evidence of selection at the APOE locus7,8 sug-

gested the hypothesis that AD-associated pathways may

have experienced selection in human populations, quite

possibly due to selective pressure on a phenotype un-

related to AD, given that AD has little impact on an

individual’s reproductive fitness. To further explore this

hypothesis, we integrated evidence for natural selection

within validated AD loci with a pathway-based analysis

of these loci and an examination of transcriptional

patterns in immune cells. We identified several genes in

AD susceptibility loci that appear to physically interact

and that show evidence of having undergone natural

selection. Additionally, several of these loci exhibit a cis-

regulatory effect on the transcription levels of the selected

genes in immune cells, suggesting one plausible mecha-

nism by which an AD-related molecular pathway may

have evolved in response to environmental pressures in

early human history.

Given reports that the APOE ε4 haplotype exhibits

evidence for natural selection,7,8 we assessed all validated

and well-replicated AD susceptibility loci3,4 for evidence

of recent (<60,000 years ago) positive selection, using

linkage disequilibrium (LD)-based methods to detect

genomic regions harboring genetic variants inferred to

have recently and rapidly increased in frequency within

human populations. We applied the integrated haplotype

score (iHS)9 statistic, which measures the lengths of

the haplotypes around a given SNP, to identify evidence

of positive selection in human populations of African

(Yoruba from Ibadan, Nigeria [YRI]), European (Centre

d0Etude du Polymorphisme Humain samples from Utah

residents of European descent [CEU]), and East Asian

(Japanese subjects from Tokyo and Han Chinese sub-

jects from Beijing representing Asian populations [ASI])

ancestry from phase II of the International HapMap

Project.10 The iHS statistic was implemented with a mean

of 0 and a variance of 1. To discover signatures of selec-

tion on the haplotype bearing AD susceptibility alleles,

we searched for SNPs meeting a stringent threshold of

jiHSj > 2 (corresponding to the most extreme 5% of iHS

values across the genome among HapMap II SNPs with

minor allele frequency > 0.05) within the LD block con-

taining the index SNP of each locus from two large-scale

AD genome-wide association studies (GWAS).3,4 Further-

more, we restricted our analysis to SNPs with r2 > 0.5

and/or D0 ¼ 1, the published index SNP being associated

with AD in each locus. Although we tested only 11 loci,

we applied a stringent correction for genome-wide testing

to identify only the most robust signals of selection. Using

this approach, we found evidence for selection acting on

haplotypes carrying AD-associated alleles in 3 of 11 loci,

including PICALM (MIM 603025; rs561655), BIN1 (MIM

1Program in Translational NeuroPsychiatric Genomics, Institute for the Neurosciences Department of Neurology, Brigham and Women’s Hospital, 77

Avenue Louis Pasteur, Boston, MA 02115, USA; 2Division of Genetics, Department of Medicine, Brigham and Women’s Hospital, 77 Avenue Louis Pasteur,

Boston, MA 02115, USA; 3Harvard Medical School, Boston, MA 02115, USA; 4Program in Medical and Population Genetics, The Broad Institute, 7 Cam-

bridge Center, Cambridge, MA 02142, USA; 5Departments of Internal Medicine and Neurological Science, Rush University Medical Center, 600 S Paulina

Street, Chicago, IL 60612, USA; 6Rush Alzheimer’s Disease Center, Rush University Medical Center, 600 S Paulina Street, Chicago, IL 60612, USA

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.022. �2012 by The American Society of Human Genetics. All rights reserved.

720 The American Journal of Human Genetics 90, 720–726, April 6, 2012

601248; rs7561528), and CD2AP (MIM 604241;

rs9349407) (Table 1). All of these loci showed significant

(with the use of a threshold for genome-wide testing)

evidence for selection in the same HapMap population of

East Asian descent, suggesting that the loci may have

responded to the same selective pressure. Using a less

stringent ‘‘suggestive’’ threshold of significance (jiHSj >

1.65, corresponding to the top 10% of iHS values across

the genome), we also saw evidence of natural selection at

MS4A2 (MIM 147138; rs610932) in the African population

(Table 1). Except for the MS4A2 locus, the index SNP from

the reported AD GWAS was not the SNP exhibiting the

strongest evidence of selection. This is not surprising given

that these index SNPs emerged from genome-wide screens

and are most likely surrogate markers for the causal

variant(s) in each locus. The evidence for selection may,

in fact, help to pinpoint variants that are more likely to

have a functional effect involved in driving the selection

process and perhaps AD susceptibility.

Interestingly, each positively selected allele on the

AD-associated haplotypes in the PICALM, BIN1, and

CD2AP loci had a high positive iHS score in the East Asian

subjects (Table 1). A positive iHS means that the haplo-

types with the ancestral allele that humans share with

chimpanzees are longer than those containing the derived

allele; therefore, these results suggest that selection favored

the ancestral allele at all three loci. Although a neighboring

derived allele that is targeted by a selective force could

drive selection at one ancestral allele, this is unlikely to

be the case in all three loci. It is more likely that the three

ancestral alleles are themselves the targets of selection. In

all three cases, the selected ancestral alleles are those asso-

ciated with diminished susceptibility to AD; the derived

alleles are the risk alleles. This observation introduces the

possibility that these three loci worked together in a shared

pathway and were affected in a coordinated manner over

the course of human evolution in the East Asian popula-

tion that encountered a specific selective pressure. Further-

more, the three risk-associated alleles have all been selected

against, suggesting that they may have converging effects

on the same cellular function, one implicated in AD

susceptibility but most likely important in other contexts

as well.

To further confirm the robustness of the selection

signals, we deployed alternative methods of selecting

SNP sets with which to detect evidence for natural

selection in AD loci and validate the results of the haplo-

type-based iHS analysis. In a region-based analysis, we

determined the proportion of SNPs with jiHSj > 2 in

a 50-SNP window centered on the index SNP. In a gene-

based approach, we created a window of 50 SNPs centered

on each gene closest to the index SNP; the genes in the

upper 10% of the empirical distribution for number of

significant SNPs were then considered to be candidate

targets of selection (Table 2). Table 2 and Table S1 summa-

rize the results of these analyses; they are consistent with

the results of the haplotype-based analysis. In these ana-

lyses, we subsequently explored the possibility that APOE

or genes associated with early-onset, Mendelian forms

of AD (APP [MIM 104760], PSEN1 [MIM 104311], and

Table 1. Alzheimer Disease Susceptibility Loci with Evidence of Positive Selection

Chr Index SNP Locus Tag SNPa

Integrated Haplotype Score (iHS)

ASI CEU YRI

Loci with Evidence of Selection

11q14 rs561655 PICALMb rs659023 2.17 1.31 �1.88

2q14 rs7561528 BIN1b rs10200967 2.58 1.64 1.45

6p12 rs9349407 CD2APb rs9395288 2.06 �0.49 �1.08

11q12 rs610932 MS4A2b rs610932 �0.88 0.73 �1.92

Other Loci

7q35 rs11767557 EPHA1b – �0.74 0.27 0.59

1q32 rs6701713 CR1 – �0.11 0.26 �0.71

19q13 rs3865444 CD33 – 0.43 �0.87 –

19q13 rs2075650 APOE – 0.55 1.5 –

19q13 rs4420638 APOE – 0.45 0.7 �0.16

8p21 rs1532278 CLU – – – –

19p13 rs3764650 ABCA7 – – – –

ASI, East Asian; YRI, African; CEU, European.aTag SNPs: SNPs that best capture the selection signal on each AD susceptibility haplotype. Tag SNPs were chosen on the basis of their LD (r2 > 0.5; D0 ¼ 1) withthe published index SNP for each locus, evidence for selection signals, and AD GWAS association at p< 10�6. Absolute iHS> 2 and> 1.65 correspond to the mostextreme 5% and 10% of iHS values across the genome, respectively.bShows evidence for selection in gene-based analyses (Table S2).

The American Journal of Human Genetics 90, 720–726, April 6, 2012 721

PSEN2 [MIM 600759]) harbor evidence of natural selec-

tion; however, none of these loci returned jiHSj scores

that met our suggestive or significant thresholds (data

not shown).

To test for enrichment of positive selection among the

AD-associated loci, we compared the proportion of haplo-

types with positively selected loci in the set of AD-

validated loci with a similar list of Parkinson disease (PD

[MIM 168600]) -validated loci (n ¼ 12).11 In this targeted

analysis, we imposed our suggestive jiHSj > 1.65 threshold

and found that the proportion of AD-associated loci

meeting this threshold of evidence for selection in at least

one HapMap population (45%, 5 of 11 SNPs have jiHSj >1.65; Table 1) was higher than the proportion of PD

loci under selection (8%, 1 of 12 SNPs have jiHSj > 1.65;

c2AD versus PD ¼ 4.774, pAD versus PD ¼ 0.029). These results

suggest enrichment for loci with evidence of selection

amongvalidatedAD susceptibility loci.Weobtained similar

results when comparing the AD loci to random sets of SNPs

with similar allele frequencies (data not shown).

On the basis of the hypothesis that evidence for natural

selection might be a feature of other, as-yet undiscovered

AD susceptibility loci, we extended these analyses to the

list of loci that met a suggestive threshold of significance

(p< 10�4) for association with AD susceptibility in a recent

GWAS that provided a comprehensive list of such results.3

Out of an initial 447 suggestive SNPs in this study, we

found 118 loci with independent effects (after LD pruning

of SNPs with an r2 ¼ 0.5 threshold). We found that 10 (8%)

of these 118 loci have an jiHSj > 2 and therefore meet our

threshold of genome-wide significance. Furthermore, 21

loci (18%) demonstrate suggestive evidence (jiHSj > 1.65)

of natural selection, which is more than expected by

chance (expected: 11.8 [10%]; p ¼ 0.005). Thus, our obser-

vation in the validated AD loci may be a more general

feature of AD susceptibility loci.

Given that 3 of 11 AD-associated loci show evidence of

selection in the same human population, it is plausible

that some of the genes they contain may have a correlated

evolutionary history and are coevolving. Coevolution can

occur when a heritable change in one gene establishes

selective pressure for another gene.12 To detect such coevo-

lutionary processes, we constructed protein-protein inter-

action (PPI) networks to identify interacting proteins,

because two proteins are more likely to share correlated

evolutionary history if they physically interact.13 This

method of constructing networks does not leverage infor-

mation regarding natural selection and is therefore inde-

pendent of our earlier analyses. To construct a PPI network,

we selected genes found within validated and sug-

gested AD-associated loci (defined as the genomic segment

bounded by SNPs with an r2 > 0.5 to the index SNP for

a given locus) as input for the web-based tool Disease

Association Protein-Protein Link Evaluator (DAPPLE),

which uses high-confidence pairwise protein interactions

and tissue-specific expression data to reconstruct a PPI

network.14 The network is conservative, requiring that

interacting proteins be known to be coexpressed in a given

tissue. The resultant AD PPI network returned by DAPPLE

is statistically significant for a network connectivity that

allows a common interactor protein (not known to be asso-

ciated with AD) between AD genes when compared to

1,000 random networks (permuted p ¼ 0.043; Figure 1).

This analysis is not significant (p ¼ 0.367) when direct

network connectivity between the AD genes is required.

We were able to indirectly connect all but two of the

proteins coded in known AD loci (ABCA7 [MIM 605414]

and CR1 [MIM 120620]).

When overlaying our AD PPI network allowing for

a common interactor protein with the results of our

analyses for evidence of natural selection, an intrigu-

ing subnetwork that includes the proteins encoded by

PICALM, BIN1, CD2AP, and EPHA1 (MIM 179610)

emerged. Integrating our gene-based PPI results with our

gene-based selection analysis (Table S1), we saw that all

four susceptibility loci showed evidence of positive

Table 2. Alzheimer-Disease-Associated Haplotypes, Regions, and Genes with Evidence of Positive Selection

Locus SNP Gene

Three Different Approaches to Assess for Evidence of Positive Selection

Haplotype Region Gene

ASI CEU YRI ASI CEU YRI ASI CEU YRI

6p12 rs9349407 CD2AP þ þ þ

11q12 rs610932 MS4A2 þ þ þ

11q14 rs561655 PICALM þ þ þ þ þ

2q14 rs7561528 BIN1 þ þ þ

7q35 rs11767557 EPHA1 þ

Evidence of selection (indicated by ‘‘þ’’) based on the analyses from alternate methods of selecting SNP sets with which to evaluate evidence for selection.The ‘‘haplotype’’ analysis is our primary analysis. In the haplotype analysis, we searched for SNPs in LD (r2> 0.5; D0 ¼ 1) with the index SNP in each locus (Table 1).We then considered haplotype blocks containing a linked SNP with jiHSj> 2 to be candidate targets of selection. We deployed secondary analyses to illustrate therobustness of our results. For our secondary regional analysis, we determined the number of SNPs with jiHSj> 2 in a 50-SNP window centered on the GWAS SNP.We then considered the regions in the upper 10% of the empirical distribution for proportion of SNPs with jiHSj> 2 to be candidate targets of selection. For gene-based analysis, we determined the proportion of SNPs with jiHSj> 2 in a 50-SNP window centered on the gene closest to the GWAS SNP. We then considered thegenes in the upper 10% of the empirical distribution for number of significant SNPs to be candidate targets of selection. The symbol ‘‘þ’’ indicates evidence forselection at each locus in three HapMap populations.

722 The American Journal of Human Genetics 90, 720–726, April 6, 2012

selection in East Asians (Table 2 and Table S1). This

evidence suggests a possible coevolution of these four

genes, which, on the basis of the PPI analysis, may interact

in a macromolecular complex (Figures 1 and 2). To assess

the statistical significance of this subnetwork, we con-

structed a new PPI network by using these four genes as

the seed regions, and found that the subnetwork was statis-

tically significant for indirect connectivity (permuted p ¼0.033; Figure 2). We observed that many of the interacting

proteins connecting the AD-associated proteins also

showed significant evidence for selection (Figure 2), and,

interestingly, one of these proteins is encoded by a gene

(GAB2 [MIM 606203]) that has been previously suggested

to be associated with AD.15,16 Thus, these AD-associated

and ‘‘connector’’ genes may have been under selection

because of their participation in a single functional module

over evolutionary time, and common function may also

explain their individual associations with AD suscepti-

bility. Given the late age-at-onset of AD, it is unlikely

that this phenotype itself has been the target of selec-

tion over evolutionary time; rather, it is likely that these

four genes, along with their interaction partners, com-

pose a functional module that is important in other biolog-

ical contexts.

Some of the strongest selective pressures acting during

human evolution have been attributed to human interac-

tions with pathogens; we thus investigated the functional

consequences of AD-associated variants with evidence of

selection on gene expression levels in immune cells. This

approach was additionally motivated by the observation

Figure 1. Protein-Protein Interaction NetworkGenerated from Proteins Encoding for AD-Associated and -Suggested GenesThe 118 LD-pruned SNPs (p < 10-4) from the Alz-heimer Disease Genetics Consortium (ADGC)GWAS3, which provided a complete list of sugges-tive results, were included as an input to DAPPLE.DAPPLE selects genes from a SNP list based on theregion containing SNPs with an r2 > 0.5 to eachindex SNP; this region is then extended to thenearest recombination hot spot. AD-associatedproteins are represented as nodes connected byan edge if there is in vitro evidence for high-confi-dence interaction. The large colored circles repre-sent AD-associated and -suggestive proteins, andsmall circles in grey represent the connectedproteins (not known to be associated with AD).Most of the AD-associated proteins are connectedvia common interactor proteins (gray) withwhich the associated proteins each share anedge. The red points indicate genes (CD2AP,PICALM, BIN1, EPHA1, and MS4A2) under posi-tive selection in our gene-based analyses ofHapMap II populations (Table S1). The PPInetwork is statistically significant for indirectconnectivity (PPI network permuted p ¼ 0.043).

that some AD-associated variants map close

to genes with putative immune function

(i.e., CD33 [MIM 159590], CR1, and

MS4A2) and others, such as PICALM, BIN1, CD2AP, and

EPHA1, are coexpressed in immune cells, particularly in

the myeloid lineage.17 Thus, we assessed each of the 11

validated and well-replicated AD loci for evidence of an

effect on RNA expression in cis; that is, we performed an

expression quantitative trait locus (eQTL) analysis in

each locus for genes in the vicinity of the index SNP (see

Supplemental Material andMethods). We first investigated

genetic variation and mRNA expression in an available

data set derived from peripheral blood mononuclear cells

(PBMCs) of 228 individuals of European ancestry with

demyelinating disease,18 representing a set of subjects

with an activated immune system. PBMCs are purified

from peripheral blood and contain both myeloid and

lymphoid cells. After gene-based permutation assessing

significance of SNP-gene association p values, we identified

five cis-regulatory effects in the 11 tested loci (Table S2).

Specifically, we found three AD-associated index vari-

ants—rs610932, rs7561528, and rs3752246—with cis-

regulatory effects in PBMCs: rs610932 influences the

expression of neighboring MS4A2 (nominal p ¼ 5.04 3

10�8), whereas rs7561528 and rs3752246 affect the ex-

pression of BIN1 (p ¼ 4.76 3 10�4) and ABCA7 (p ¼6.87 3 10�5), respectively. These cis-regulatory effects are

all significant at a p < 0.05 permutation threshold; the

same MS4A2 and BIN1 haplotypes harbor evidence of

natural selection (Table 1).

In the PICALM locus, which has strong evidence of selec-

tion, the current best ADmarker does not have a strong cis-

regulatory effect on gene expression, but other SNPs that

The American Journal of Human Genetics 90, 720–726, April 6, 2012 723

have evidence of association with AD (p < 5x10�8 in pub-

lished studies) 3,4 and are in strong LD with the index SNP

have replicated evidence of having a cis-regulatory effect

on gene expression. Specifically, the AD risk allele

rs659023G (pAD ¼ 2.78 3 10�10)3 (r2 ¼ 0.8 with index

PICALM SNP rs561655) is significantly associated with

decreased expression of PICALM in PBMC (p ¼ 4.76 3

10�4; Figure S1). We replicate this cis-regulatory effect in

CD4þ T lymphocytes (which are constituents of the

PBMC cell mixture) of 40 healthy individuals of European

ancestry (p ¼ 2.48 3 10�4).

In the case of PICALM, the evidence for selection and the

effect on gene expression converge (Figure 3; Table S3): the

rs659023 SNP discussed above exhibits a strong correlation

with PICALM RNA expression and has a high iHS (eQTL

p ¼ 4.76 3 10�4, jiHSj ¼ 2.17). These results suggest that

one or more potential functional variants that influence

PICALM expression may have been selected for over the

course of human history in this well-validated AD locus.

This information can be leveraged to focus fine-mapping

efforts onto a short list of candidate causal variants that

can be targeted in well-powered AD susceptibility analyses.

In summary, we observe significant evidence for signa-

tures of positive selection on haplotypes associated with

late-onset Alzheimer disease. The most intriguing finding

of our study is that multiple AD-associated genes (i.e.,

PICALM, BIN1, CD2AP, and EPHA1) with evidence for posi-

tive selection encode proteins that physically interact

within an independently defined PPI network. Further-

more, some of the linking proteins in the network also

Figure 2. Protein-Protein Interaction Subnet-work of AD-Associated Genes under PositiveSelectionThe subnetwork is simply a highly intercon-nected subset of the larger network that emergesfrom DAPPLE and is illustrated in Figure 1.The subnetwork is statistically significant forindirect connectivity (PPI subnetwork permutedp¼ 0.033). As in Figure 1, the red dotsmark geneswith evidence for natural selection.

show evidence of positive selection, and

one of these,GAB2, has previously been sug-

gested to be associated with AD.15,16 The

convergence of results in these two comple-

mentary analyses suggests that these four

interacting proteins encoding for AD

susceptibility genes may have had a corre-

lated evolutionary history, perhaps in

response to a single evolutionary pressure

that affected East Asian populations and,

to a lesser degree, European populations

(Table 2). Moreover, we found that several

of these loci exhibit a cis-regulatory effect

on the transcription of the selected genes

in immune cells, suggesting that changes

in gene expression may be one mechanism

bywhich anAD-relatedmolecular pathway evolved in early

human history, most likely in a non-AD context.

The resultspresentedhereprovide convincing support for

recent positive selection of loci underlying late-onset AD,

but what are the selective pressures underlying the signa-

tures? It is difficult to identify the precise selective pressure

that led to our observations. Selection may result from an

array of forces including pathogen resistance aswell as envi-

ronmental and dietary changes as modern humans

migrated out of Africa and spread throughout the world.

For thegeneswithputative immune functions,we speculate

that pathogens may have exerted selective pressures on

populations and consequently influenced the frequency

of AD-associated alleles. Regardless of what the selection

pressure might have been, our results suggest that several

different loci implicated in AD susceptibility may have

worked together in another context to enhance survival

over the course of recent human evolutionary history.

Evidence for natural selection at a given variant suggests

that such a variant may have functional consequences,

given that the selection process was mediated by an alter-

ation of a biological process. One can think of selective pres-

sures asnatural, invivohumanexperiments inwhichwecan

measure the response of human populations to unknown

perturbations, and these alterations can inform the function

of geneswithin a given locus.However, by itself, evidence of

selection is not sufficient for identifying causal variants in

susceptibility loci. Rather, it offers another key dimension

of functional information to integrative analyses that

include disease association, protein-protein interaction,

724 The American Journal of Human Genetics 90, 720–726, April 6, 2012

and gene expression. Such analyses will powerfully

explore the molecular mechanisms underlying associations

between genetic variation and disease susceptibility. In the

case of AD, we have highlighted a network of coexpressed,

physically interacting susceptibility genes that is supported

by evidence of selection; this observation lays the ground-

work for future hypothesis-driven investigations into the

function of the interactions of these susceptibility genes,

which may be related to vesicular trafficking, given the

known functions of PICALM, BIN1, CD2AP, and EPHA1.

On the basis of our expression data, characterization of cell

populations fromperipheral bloodmayprovidea reasonable

substrate for functional investigations that seek to interro-

gate the coordinated functional consequences of these four

susceptibility loci, and perhaps others.

Supplemental Data

Supplemental Data include one figure and three tables and can be

found with this article online at http://www.cell.com/AJHG/.

Acknowledgments

This work is supported by the National Institutes of Health (NIH)

(RC2 GM093080, R01 AG30146, R01 AG179917, R01 AG15819,

K08 AG034290, P30 AG10161 and R01 AG11101). J.M.S. was addi-

tionally supported by the BurroughsWellcome Fund.We thank the

BrighamandWomen’sHospitalPhenoGeneticProject forproviding

mRNA samples fromhealthy subjects that were used in theCD4þ T

lymphocyte transcriptional analysis for this study. We thank Mi-

chelle Lee for sample collection, Katherine Rothamel for data gener-

ation, and Scott Davis for mRNA expression quality control. We

thank Christophe Benoist for his leadership on RC2 GM093080.

These analyses were conducted under the auspices of a protocol

approved by the institutional review board of Partners Healthcare.

Received: October 18, 2011

Revised: February 3, 2012

Accepted: February 22, 2012

Published online: April 5, 2012

Web Resources

The URLs for data presented herein are as follows:

DAPPLE, http://www.broadinstitute.org/mpg/dapple/dapple.php

dbSNP, http://www.ncbi.nlm.nih.gov/projects/SNP/

Haplotter, http://haplotter.uchicago.edu/

HapMap FTP site, ftp://ftp.ncbi.nlm.nih.gov/hapmap/

iHS software, http://hgdp.uchicago.edu/Software/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

Figure 3. Colocalization of cis-Regulatory Effects and Positive Selection Signals in the PICALM LocusThe top panel reports the cis-regulatory effects of each SNP in the vicinity of the PICALM locus on PICALM RNA expression in PBMC; -log(p value) is reported on the y axis. The lower panel reports the evidence for positive selection at each SNP over the same chromosomalsegment (1 Mb total); here we have inverted the y axis so that the most extreme iHS values are at the bottom of the scale. The RefSeqgenes in the region are shown at the bottom of the figure. The LD (in r2) for each SNP with the index AD-associated PICALM SNP(rs561655) is illustrated with the use of colors, as indicated in the top right of the figure. The haplotype carrying GWAS index SNPrs561655 also contains other alleles that have the strongest selection signals and cis eQTL effects, suggesting that functional variantsinfluencing the expression of PICALM may have been the target of recent natural selection. The SNPs with extreme iHS values arethe same SNPs that have extreme p values in the cis eQTL analysis (Table S3).

The American Journal of Human Genetics 90, 720–726, April 6, 2012 725

References

1. Avramopoulos, D. (2009). Genetics of Alzheimer’s disease:

recent advances. Genome Med 1, 34.

2. Harold, D., Abraham, R., Hollingworth, P., Sims, R., Gerrish,

A., Hamshere, M.L., Pahwa, J.S., Moskvina, V., Dowzell, K.,

Williams, A., et al. (2009). Genome-wide association study

identifies variants at CLU and PICALM associated with

Alzheimer’s disease. Nat. Genet. 41, 1088–1093.

3. Naj, A.C., Jun, G., Beecham, G.W., Wang, L.S., Vardarajan,

B.N., Buros, J., Gallins, P.J., Buxbaum, J.D., Jarvik, G.P., Crane,

P.K., et al. (2011). Common variants at MS4A4/MS4A6E,

CD2AP, CD33 and EPHA1 are associated with late-onset

Alzheimer’s disease. Nat. Genet. 43, 436–441.

4. Hollingworth, P., Harold, D., Sims, R., Gerrish, A., Lambert,

J.C., Carrasquillo, M.M., Abraham, R., Hamshere, M.L.,

Pahwa, J.S., Moskvina, V., et al. (2011). Common variants at

ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are

associated with Alzheimer’s disease. Nat. Genet. 43, 429–435.

5. Lambert, J.C., Heath, S., Even, G., Campion, D., Sleegers, K.,

Hiltunen, M., Combarros, O., Zelenika, D., Bullido, M.J.,

Tavernier, B., et al; European Alzheimer’s Disease Initiative

Investigators. (2009). Genome-wide association study iden-

tifies variants at CLU and CR1 associated with Alzheimer’s

disease. Nat. Genet. 41, 1094–1099.

6. Seshadri, S., Fitzpatrick, A.L., Ikram, M.A., DeStefano, A.L.,

Gudnason, V., Boada, M., Bis, J.C., Smith, A.V., Carassquillo,

M.M., Lambert, J.C., et al; CHARGE Consortium; GERAD1

Consortium; EADI1 Consortium. (2010). Genome-wide anal-

ysis of genetic loci associated with Alzheimer disease. JAMA

303, 1832–1840.

7. Drenos, F., and Kirkwood, T.B. (2010). Selection on alleles

affecting human longevity and late-life disease: the example

of apolipoprotein E. PLoS ONE 5, e10022.

8. Vamathevan, J.J., Hasan, S., Emes, R.D., Amrine-Madsen, H.,

Rajagopalan, D., Topp, S.D., Kumar, V., Word, M., Simmons,

M.D., Foord, S.M., et al. (2008). The role of positive selection

in determining the molecular cause of species differences in

disease. BMC Evol. Biol. 8, 273.

9. Voight, B.F., Kudaravalli, S., Wen, X., and Pritchard, J.K.

(2006). A map of recent positive selection in the human

genome. PLoS Biol. 4, e72.

10. Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve,

L.L., Gibbs, R.A., Belmont, J.W., Boudreau, A., Hardenbol, P.,

Leal, S.M., et al; International HapMap Consortium. (2007).

A second generation human haplotype map of over 3.1

million SNPs. Nature 449, 851–861.

11. Nalls, M.A., Plagnol, V., Hernandez, D.G., Sharma, M.,

Sheerin, U.M., Saad, M., Simon-Sanchez, J., Schulte, C.,

Lesage, S., Sveinbjornsdottir, S., et al; International Parkinson

Disease Genomics Consortium. (2011). Imputation of

sequence variants for identification of genetic risks for Parkin-

son’s disease: a meta-analysis of genome-wide association

studies. Lancet 377, 641–649.

12. Fraser, H.B., Hirsh, A.E., Wall, D.P., and Eisen, M.B. (2004).

Coevolution of gene expression among interacting proteins.

Proc. Natl. Acad. Sci. USA 101, 9033–9038.

13. Tillier, E.R., and Charlebois, R.L. (2009). The human protein

coevolution network. Genome Res. 19, 1861–1871.

14. Rossin, E.J., Lage, K., Raychaudhuri, S., Xavier, R.J., Tatar, D.,

Benita, Y., Cotsapas, C., and Daly, M.J.; International Inflam-

matory Bowel Disease Genetics Constortium. (2011). Proteins

encoded in genomic regions associated with immune-

mediated disease physically interact and suggest underlying

biology. PLoS Genet. 7, e1001273.

15. Reiman, E.M., Webster, J.A., Myers, A.J., Hardy, J., Dunckley, T.,

Zismann, V.L., Joshipura, K.D., Pearson, J.V., Hu-Lince, D.,

Huentelman, M.J., et al. (2007). GAB2 alleles modify

Alzheimer’s risk inAPOEepsilon4 carriers.Neuron54, 713–720.

16. Ikram, M.A., Liu, F., Oostra, B.A., Hofman, A., van Duijn,

C.M., and Breteler, M.M. (2009). The GAB2 gene and the

risk of Alzheimer’s disease: replication and meta-analysis.

Biol. Psychiatry 65, 995–999.

17. Wu, C., Orozco, C., Boyer, J., Leglise, M., Goodale, J., Batalov,

S., Hodge, C.L., Haase, J., Janes, J., Huss, J.W., 3rd, and Su, A.I.

(2009). BioGPS: an extensible and customizable portal for

querying and organizing gene annotation resources. Genome

Biol. 10, R130.

18. De Jager, P., Jia, X., Wang, J., de Bakker, P., Ottoboni, L., Aggar-

wal, N., Piccio, L., Raychaudhuri, S., Tran, D., Aubin, C., et al.

(2009). Meta-analysis of genome scans and replication

identify CD6, IRF8 and TNFRSF1A as new multiple sclerosis

susceptibility loci. Nat. Genet. 41, 776–782.

726 The American Journal of Human Genetics 90, 720–726, April 6, 2012

REPORT

Linkage-Disequilibrium-Based BinningAffects the Interpretation of GWASs

Andrea Christoforou,1,2,16,* Michael Dondrup,3,16 Morten Mattingsdal,4,5,16 Manuel Mattheisen,6,7,8,9,16

Sudheer Giddaluru,1,2 Markus M. Nothen,6,7,10 Marcella Rietschel,11 Sven Cichon,1,6,7,12

Srdjan Djurovic,4,13,14 Ole A. Andreassen,4,14 Inge Jonassen,3,15 Vidar M. Steen,1,2 Pal Puntervoll,3

and Stephanie Le Hellard1,2

Genome-wide association studies (GWASs) are critically dependent on detailed knowledge of the pattern of linkage disequilibrium (LD)

in the human genome. GWASs generate lists of variants, usually SNPs, ranked according to the significance of their association to a trait.

Downstream analyses generally focus on the gene or genes that are physically closest to these SNPs and ignore their LD profile with other

SNPs. We have developed a flexible R package (LDsnpR) that efficiently assigns SNPs to genes on the basis of both their physical position

and their pairwise LD with other SNPs. We used the positional-binning and LD-based-binning approaches to investigate whether

including these ‘‘LD-based’’ SNPs would affect the interpretation of three published GWASs on bipolar affective disorder (BP) and of

the imputed versions of two of these GWASs. We show how including LD can be important for interpreting and comparing GWASs.

In the published, unimputed GWASs, LD-based binning effectively ‘‘recovered’’ 6.1%–8.3% of Ensembl-defined genes. It altered the

ranks of the genes and resulted in nonnegligible differences between the lists of the top 2,000 genes emerging from the two binning

approaches. It also improved the overall gene-based concordance between independent BP studies. In the imputed datasets, although

the increases in coverage (>0.4%) and rank changes were more modest, even greater concordance between the studies was observed,

attesting to the potential of LD-based binning on imputed data as well. Thus, ignoring LD can result in the misinterpretation of the

GWAS findings and have an impact on subsequent genetic and functional studies.

Over the past decade, genome-wide association studies

(GWASs) have revolutionized the analysis of human

complex genetic traits. By scanning hundreds of thou-

sands of genetic variants, typically SNPs, in hundreds or

thousands of individuals, they search for the variant(s)

that associate with a particular disease or trait. Critical to

the development and evolution of GWASs has been the

creation of the International HapMap Project,1 which

has cataloged the common patterns of human genetic vari-

ation, including the linkage disequilibrium (LD) between

SNPs. Knowledge of this LD, or nonrandom association

of alleles at multiple loci, has made it possible to identify

informative subsets of SNPs (i.e., ‘‘tagging SNPs’’) that

capture the bulk of genome-wide variation and has re-

sulted in affordable genome-wide genotyping. To date,

almost 1,000 GWASs have been published and have tested

hundreds of human traits and reported thousands of

significant associations (Catalog of Published Genome-

Wide Association Studies2). Previously known associations

have been confirmed, and new candidates have been

implicated.3 However, a general sense of disappointment

lingers because GWASs have fallen short of the initial

expectation that they would unravel the genetic basis of

complex traits.4,5 Recent analyses reveal that a large

proportion of the ‘‘missing heritability’’5,6 can be ex-

plained by a polygenic model that considers all GWAS

SNPs simultaneously,7–9 but these studies provide no clues

about the identity of the susceptibility variants or the

underlying biology of the trait.6 Thus, much attention

has been given to uncovering and characterizing this

‘‘missing’’ or ‘‘hidden’’ heritability.6,10

In a conventional GWAS, each SNP is considered sepa-

rately (the ‘‘single-marker’’ approach), resulting in a list

of variants ranked according to the statistical significance

of their association to the trait (i.e., their p value).11 The

‘‘top hits’’ are typically reported, and the relevance of

each finding, as well as the focus of future work, is

primarily based on the functional unit(s), namely gene(s),

implicated by the associated SNP. Furthermore, gene-based

methods are increasingly being applied as complementary

approaches to the analysis of GWAS data. These methods

take the gene instead of the individual SNP as the basic

unit of association and thus allow aggregation of SNPs of

smaller effect, potentially increasing power and reducing

1Dr. Einar Martens Research Group for Biological Psychiatry, Department of Clinical Medicine, University of Bergen, 5021 Bergen, Norway; 2Center for

Medical Genetics and Molecular Medicine, Haukeland University Hospital, 5021 Bergen, Norway; 3Computational Biology Unit, Uni Computing, Uni

Research, 5008 Bergen, Norway; 4Institute of Clinical Medicine, University of Oslo, 0318 Oslo, Norway; 5Research Unit, Sørlandet Hospital HF, 4604

Kristiansand, Norway; 6Department of Genomics, Life and Brain Center, University of Bonn, 53127 Bonn, Germany; 7Institute of Human Genetics,

University of Bonn, 53127 Bonn, Germany; 8Institute for Genomic Mathematics, University of Bonn, 53127 Bonn, Germany; 9Department of Biostatistics,

Harvard School of Public Health, Boston, MA 02115, USA; 10German Centre for Neurodegenerative Disorders, 53175 Bonn, Germany; 11Department of

Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, University of Mannheim, 68159 Mannheim, Germany; 12Structural and

Functional Organization of the Brain, Institute of Neuroscience and Medicine, Research Center Julich, 52425 Julich, Germany; 13Department of

Medical Genetics, Oslo University Hospital, 0424 Oslo, Norway; 14Division of Mental Health and Addiction, Oslo University Hospital, 0424 Oslo, Norway;15Department of Informatics, University of Bergen, 5008 Bergen, Norway16These authors contributed equally to this work

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.025. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 727–733, April 6, 2012 727

the multiple-testing burden.12–14 They enable the incorpo-

ration of biological knowledge for greater insight into the

mechanisms underlying the trait and are essential for

subsequent pathway-based approaches.13 Gene-based

methods also facilitate direct comparison of independent

studies because they are unaffected by allelic heterogeneity

and potential differences in SNP coverage and LD

patterns.15

The success of both single-marker and gene-based

approaches is critically dependent on the correct assign-

ment of SNPs to genes. At the single-marker level, the

aim is to identify the gene(s) that the associated SNP is

tagging. At the gene level, the aim is to attribute all SNPs

tagging a particular gene to that gene. Although LD can

span hundreds of kilobases,16,17 when GWAS results

emerge, the SNPs of interest are typically assigned to the

nearest gene or transcript within a specified distance.14

In turn, genes are typically represented only by the SNPs

that are physically located within the transcribed region

or predefined flanking region.13 It is not systematically

taken into consideration that an associated SNP might be

in high LD with another SNP (genotyped or not) located

hundreds of kilobases away in a different gene or that

a genotyped SNP positioned outside the defined bound-

aries of a gene is tagging that gene. Here, we show that

ignoring LD discards valuable information and potentially

leads to the incorrect localization of the association signal

and might mislead the interpretation of GWAS data.

We have therefore developed a flexible R package

(LDsnpR) that systematically assigns SNPs to genes (or rele-

vant predefined genome ‘‘bins’’) by using SNP association

results (e.g., p values), bin definitions, and precalculated

pairwise LD data (e.g., r2 values) provided by the user

(Figure S1, available online). By default, LDsnpR assigns

a SNP to a bin if that SNP is located within the physical

boundaries of that bin (i.e., the ‘‘positional-binning’’

approach). Then, as a unique feature of this package, the

user has the option of also assigning a genotyped SNP to

a bin if that SNP is in high pairwise LD with another SNP

(genotyped or not) located within the physical boundaries

of that bin (i.e., ‘‘LD-based-binning’’ approach). Although

a genotyped SNP cannot be assigned to a particular gene

more than once, it can be assigned to more than one gene.

As proof of principal, we used LDsnpR to assess the

impact of the LD-based-binning approach (versus the posi-

tional-binning approach) on the results of three published

GWASs on bipolar disorder (BP), each unimputed and gen-

otyped on a different platform. The three GWASs are (1)

the UK-based Wellcome Trust Case Control Consortium

(WTCCC) BP GWAS,18 (2) the Norwegian Thematically

Organized Psychosis (TOP) BP GWAS,19 and (3) a German

BP GWAS20 (Table 1). Each GWAS had been previously

Table 1. Study Descriptions and Summary of Coverage for Positional-Binning and LD-Based-Binning Approaches for Original, UnimputedDatasets

WTCCCa TOPb Germanc

Sample size(cases/controls)

1,868/2,938 198/336 682/1,300

Platform used Affymetrix 500K Affymetrix6.0 Illumina HumanHap550v3

Number of post-QCSNPs for binning

468,648 615,396 511,978

Binning data Positionalbinning

LD-basedbinning

Differenced Positionalbinning

LD-basedbinning

Differenced Positionalbinning

LD-basedbinning

Differenced

Number of genescoverede

30,610(83.4%)

33,443(91.1%)

2,833(9.3%)

31,823(86.7%)

33,905(92.4%)

2,082(6.5%)

31,708(86.4%)

33,861(92.3%)

2,153(6.8%)

Number of post-QCSNPs binned

237,869(50.8%)

277,534(59.2%)

39,665(16.7%)

307,949(50.0%)

363,570(59.1%)

55,621(18.1%)

272,914(53.3%)

308,634(60.2%)

35,720(13.1%)

Number of SNPsbinned to only 1 gene

199,752(84.0%)

178,544(64.3%)

21,208(10.6%)

259,223(84.2%)

234,036(64.4%)

25,187(9.7%)

228,098(83.6%)

209,458(67.9%)

18,640(8.2%)

Number of SNPsbinned to ten or more

135(0.057%)

2,537(0.91%)

2,402 174(0.057%)

3,106(0.85%)

2,932 141(0.052%)

2,072(0.67%)

1,931

Mean number of SNPsper bin (median)

9.4 (4) 15.2 (10) 6.6 (4) 11.7 (5) 19.4 (13) 8.4 (6) 10.5 (5) 15.4 (10) 5.6 (4)

Range (min–max) 1–514 1–515 0–87 1–687 1–701 0–112 1–655 1–665 0–64

Number of geneswith only one SNP

4,830(15.8%)

1,531(4.6%)

3,299(68.3%)

3,604(11.3%)

992(2.9%)

2,612(72.5%)

3,647(11.5%)

595(1.8%)

3,052(83.7%)

The following abbreviation is used: QC, quality control.aThe UK-based Wellcome Trust Case Control Consortium (WTCCC) BP GWAS.17bThe Norwegian Thematically Organized Psychosis (TOP) BP GWAS.18cA German BP GWAS.19dPercentages indicate percent increase or decrease from positional to LD-based binning.eEnsembl 54 (May 2009) genes (total N ¼ 36,693) tagged by at least one SNP.

728 The American Journal of Human Genetics 90, 727–733, April 6, 2012

approved by the relevant local research ethics committees,

and all participants had provided written informed

consent.18–20 In addition, we assessed the impact of LD-

based binning on imputed versions of the TOP and

German GWASs, in which ungenotyped markers had

been statistically inferred11 on the basis of LD from

different reference panels (i.e., HapMap Phase III for TOP;

HapMap Phase III and 1,000 Genomes21 for German)

(Table 2).

BP is a severe complex psychiatric disorder that shows

high heritability (60%–80%) but for which clear genetic

risk factors remain elusive.4 Although several GWASs on

BP have been performed (Catalog of Published Genome-

Wide Association Studies2), the findings have shown little

overlap at both the SNP and gene levels. Also, only a hand-

ful of SNPs have achieved genome-wide significance

(<~10�8), and these SNPs only explain less than 3% of

the heritability,4,22 suggesting that psychiatric disorders,

such as BP, might be less amenable to GWASs than other

disorders.5,23 However, systematic LD-based gene binning

has not been applied to these datasets, possibly contrib-

uting to the apparent lack of success. Thus, we assessed

the effects of the LD-based-binning approach relative to

the traditional positional-binning approach with respect

to (1) gene coverage, (2) changes in the results and, poten-

tially, the interpretation of findings, and (3) pairwise

concordance of the findings among the BP GWASs.

In brief, for LDsnpR, gene bin definitions were based on

the Human Ensembl release 54 (May 2009) gene identifiers

with unambiguous positional information (N ¼ 36,693).

We extended these gene bins by another 10 kb on either

side to best capture potential regulatory regions.24,25 The

LD data were based on HapMap Phase II release 27 and

were restricted to that of the CEU (Utah residents with

ancestry from northern and western Europe from the

CEPH collection) sample. We set the pairwise LD at the

widely accepted threshold of r2 R 0.826 to limit the loss

of power needed for the detection of association at the

linked locus.27

We first compared the extent of coverage between the

positional-binning and LD-based-binning approaches in

the published, unimputed datasets (Table 1). By allowing

us to identify the intergenic SNPs that tag genes, LD-based

binning resulted in a ~13%–18% increase in the number of

SNPs included in the gene-binning process. Intergenic

SNPs represent ~40% of GWAS trait-associated SNPs.3

Notably, LD-based binning ‘‘recovered’’ >2,000 genes

(>6%) in all three datasets, increasing the proportion of

Ensembl 54 genes tagged by at least one SNP from ~83%

to>91%. Furthermore, there was an increase in the density

of coverage; an average of 5.6 to 8.4 (median of four to six)

SNPs were added per gene, and there was an overall

decrease (>68%) in the number of genes tagged by only

one SNP.

Table 2. Study Descriptions and Summary of Coverage for Positional-Binning and LD-Based-Binning Approaches for Imputed Datasets

TOPa Imputedb Germanc Imputedb

Sample size (cases/controls) 198/336 657/1,308

Imputation referencepanel

HapMap Phase III (CEU) 1,000 Genomes (pilot 1, CEU) and HapMap Phase III (CEU)

Post-QC SNPs for binning 992,161 4,825,148

Binning data Positionalbinning

LD-basedbinning

Differenced Positionalbinning

LD-basedbinning

Differenced

Number of genes coverede 33,242 (90.6%) 34,193 (93.2%) 951 (2.9%) 32,116 (87.5%) 32,259 (87.9%) 143 (0.4%)

Number of post-QC SNPsbinned

521,720 (52.6%) 612,316 (61.7%) 90,596 (17.4%) 2,394,441 (49.6%) 2,613,493 (54.2%) 219,052 (9.1%)

Number of SNPs binnedto only one gene

431,808 (43.5%) 367,671 (37.1%) 64,137 (14.9%) 1,979,660 (41.0%) 1,855,413 (38.5%) 124,247 (6.3%)

Number of SNPs binnedto ten or more

267 (0.03%) 7,967 (0.8%) 7,700 1,272 (0.03%) 16,807 (0.3%) 15,535

Mean number of SNPs perbin (median)

19.3 (9) 35.9 (25) 17.1 (12) 91.6 (44) 130.6 (84) 39.5 (26)

Range (min–max) 1–1,046 1–1,062 0–214 1–5,570 1–5,573 0–573

Number of genes withonly one SNP

1,795 (5.4%) 651 (1.9%) 1,144 (63.7%) 241 (0.8%) 208 (0.6%) 33 (13.7%)

The following abbreviation is used: QC, quality control.aThe Norwegian Thematically Organized Psychosis (TOP) BP GWAS.18bImputation details: the Norwegian TOP dataset was imputed according to the ENIGMA protocol with the use of MACH imputation software38 and HapMap PhaseIII (CEU) as the reference panel. The German dataset was imputed with IMPUTE2 software39 and the 1,000 Genomes Project (Pilot 1, CEU) and HapMap Phase III(CEU) as reference panels.cA German BP GWAS.19dPercentages indicate percent increase or decrease from positional to LD-based binning.eEnsembl 54 (May 2009) genes (total N ¼ 36,693) tagged by at least one SNP.

The American Journal of Human Genetics 90, 727–733, April 6, 2012 729

The imputed datasets also yielded increased coverage

(Table 2) but, as expected, to a lesser extent depending

on the reference panel used for imputation. Although

HapMap II (i.e., LDsnpR reference panel) is denser than

HapMap III28 (i.e., reference panel for the TOP and

German studies), imputation on the 1,000 Genomes data

(i.e., reference panel for the German study) potentially

gives the densest coverage. For the TOP and German

imputed datasets, LD-based binning resulted in an increase

of 17.4% and 9.1%, respectively, in the number of SNPs

included in the gene-binning process and the recovery of

951 (2.9%) and 143 (0.4%) genes, respectively. Although

this is only a small proportion of the total gene coverage,

the recovery of these genes enables them to be considered

as candidates for BP association and might lead to a better

understanding of the biology should the true association

stem from them. Also of note, in the German GWAS, LD-

based binning alone achieved an overall gene coverage of

92.3% (imputation achieved 87.5% coverage, and imputa-

tion combined with LD-based binning achieved 87.9%

coverage), suggesting that under some scenarios, LD-based

binning alone can offer the most coverage. As with the

original GWASs, there was an increase in the density of

coverage; an average of 17.1 and 39.5 (median 12 and

26) SNPs were added per gene for the TOP and German

imputed datasets, respectively. There was also a decrease

in the number of genes tagged by only one SNP (63.7%);

the decrease was not as notable for the German imputed

dataset (13.7%).

We next assessed the effects of the LD-based-binning

approach on the results of the three GWASs at both the

single-marker and gene levels. At the single-marker level,

we used the positional-binning and LD-based-binning

approaches to compare the genes tagged by the most

significant SNPs reported in the original publications18–20

(Table S1). Although LD-based binningmade no difference

to the results of the TOP BP study, three of the 14 SNPs in

the WTCCC BP study and three of the eight SNPs in the

German BP study implicated additional or alternative

genes. Interpreting GWAS single-marker results demands

fastidious consideration because when given only the

p value, it is not immediately clear where the true source

of the association originates17 and thus which is the true

candidate gene. The overall potential for mislocalizing the

association signal was underscored by the reduced number

of SNPs tagging only one gene and the increased number of

SNPs tagging ten or more genes after LD-based binning

(Tables 1 and 2). Further investigations, such as expression

studies,20 are therefore warranted before attributing puta-

tive causality to a gene and, as a result, nominating it as

the focus of future fine-mapping, functional, and other

expensive and time-consuming follow-up studies.29

As previously stated, gene-based analyses are ideal for

pathway approaches, which aid in the interpretation of

GWAS results by exploiting prior biological annotation to

determine whether certain biological functions are en-

riched (i.e., overrepresented) among the more significant

genes in a dataset. These methods require one measure of

association (or score) for each gene on the basis of the indi-

vidual SNP association signals. Here, we used a function in

LDsnpR to score each gene with the most significant

p value (i.e., the minimum p value approach), which was

adjusted for the number of SNPs tagging that gene by

a modification of Sidak’s correction.30 The minimum

p value approach is the most widely used gene-scoring

approach31 and assumes an underlying genetic architec-

ture in which a single SNP, or locus, within the gene

contributes to the disorder. The modification performs at

least as well as a powerful regression-based method in cor-

recting for the bias due to SNP number.32 In this study, the

correlation between the gene score and the number

of SNPs in the bin was reduced from Pearson r2 > 0.30 to

r2 < 0.020 in all three datasets after the modified Sidak

correction was applied. Also, permutation-based gene-set

analysis, as implemented in PLINK,33 on the German

GWAS confirmed the high correlation between modified

Sidak-corrected p values and permutation-based p values

(r2 > 0.95). The genes were scored for both the posi-

tional-binning and LD-based-binning approaches and

were compared.

The overall correlation in the ranks of the genes between

the two approaches was <0.83 in the three original data-

sets and the TOP imputed dataset, indicating that LD-

based binning altered the scores and the subsequent ranks

of the genes. Although not as large, changes in rank were

also observed in the German imputed dataset (Table 3).

When a resampling analysis was performed on the unim-

puted WTCCC dataset (it randomly excluded 5% of the

samples [20 repetitions]), the average overall correlation

in ranks due to LD-based binning (0.80) was lower than

that resulting from random fluctuations in the datasets

(>0.87), indicating greater changes due to LD (Table S2).

Such changes in rank are likely to impact threshold-free,

rank-based pathway approaches, such as gene-set-enrich-

ment analysis,34 which aims to determine whether a prede-

fined set of genes is enriched at the top of a ranked list. By

inspecting the top 2,000 genes emerging from the two

binning approaches, we found a 27%–34% difference

between the two gene lists in the three unimputed and

the TOP imputed datasets and a 15.5% difference in the

German imputed dataset. Here, the resampling analysis

in the WTCCC GWAS found that random fluctuations in

Table 3. Effect of LD-Based Binning on Ranks of Genes within EachGWAS

WTCCC TOP GermanTOPImputed

GermanImputed

Correlationa ofgene ranks

0.79 0.83 0.83 0.83 0.92

Number of genesmoving into top 2,000with LD-based binning

681(34.0%)

601(30.0%)

538(26.9%)

558(27.9%)

309(15.5%)

aSpearman rank correlation (i.e., rho).

730 The American Journal of Human Genetics 90, 727–733, April 6, 2012

the dataset led to a 25.6% change in the top 2,000 genes,

whereas LD-based binning resulted in a 30.7% difference

(Table S2). For threshold-based approaches, such as Inge-

nuity Pathway Analysis and ALIGATOR,35 in which a list

of genes meeting a specified threshold is tested for overrep-

resentation of a particular biological function, LD-based

binning could result in the submission of a substantially

different list. Changes in the ranks of the genes are thus

likely to impact the outcome of these analyses and possibly

the overall biological interpretation of the findings. The

extent to which these LD-based changes are meaningful

will also depend on the study design and resulting power,

given that the resampling analysis shows that substantial

changes in results can also occur as a result of slight

changes in the dataset.

Finally, we assessed whether LD-based binning

improved the concordance of results across studies, espe-

cially in light of the aforementioned changes in the ranks

of the genes. We compared the positional-binning and LD-

based-binning approaches by performing pairwise rank-

correlation analyses of the three GWAS datasets at both

the SNP level and the gene level (Table 4). When the posi-

tional-binning approach was used, little to no correlation

was observed at both the SNP and gene levels. However,

with LD-based binning, the overall rank correlation

increased by ~3% and was more significant for all pairwise

comparisons, including the imputed datasets. Interest-

ingly, the greatest concordance was observed when LD-

based binning was combined with imputation, high-

lighting the complementary nature of the two methods.

Although there was no obvious increase in overlap in the

top gene hits (data not shown), this increase in overall

concordance warrants the use of the LD-based-binning

approach for the reanalysis of these and other datasets in

the search for common functional gene sets and pathways.

The observed increase in correlation persisted even when

regions of high LD, such as the MHC (major histocompat-

ibility complex) region on chromosome 6, were excluded

(data not shown).

Our study illustrates the importance of systematically

accounting for LD in the interpretation of GWAS results.

To the best of our knowledge, our study is the first to quan-

tify the added value of LD-based binning; in particular, it

shows an increase in the concordance of results across

independent GWASs of a trait as complex as BP. Excluding

LD defies the basic premise of the GWAS approach by dis-

carding valuable genetic information and risking the

incorrect localization of the association signal and the

misinterpretation of the biology of the findings. Our find-

ings call for a reanalysis of previously published GWAS

data via the LD-based-binning approach and for future

GWASs to adopt this method automatically. LDsnpR facil-

itates this process by efficiently assigning SNPs to genes

and provides the option of scoring the genes for direct

entry into pathway-analysis tools. LDsnpR’s flexible frame-

work allows the application of different gene-scoring

methods; the application of such methods is necessary

for detecting gene-based associations under different

genetic architectures for the traits.31 The user-definable r2

parameter enables the scanning of a greater range of allele

frequencies at the linked locus.27 Bin definitions and pre-

calculated pairwise LD information can be updated on

the basis of the user’s interests and the information avail-

able. LD-based binning might also serve as a complemen-

tary and/or alternative approach to imputation. In partic-

ular, as high-quality LD data from the 1,000 Genomes

Project21 emerges, all GWASs, including those previously

subjected to imputation, might benefit from simple and

efficient LD-based binning at no extra cost. As we show

here, LD-based binning can further enhance imputed

GWASs, albeit to a lesser extent than unimputed datasets.

More tools that allow for incorporation of LD into the

interpretation of GWAS data are emerging,36,37 further

testifying to the importance of this approach. Also, for

studies genotyped on different platforms and/or imputed

with the use of different reference panels, LD-based

binning enables uniform comparison at both the gene

and pathway levels.

It is crucial to note that our study, as well as LDsnpR,

only addresses SNP-to-gene assignment. Issues involving

the derivation of the most accurate gene score (which

accounts for gene size and LD between SNPs), the handling

of SNPs that are assigned tomultiple, possibly overlapping,

genes, and the correlation between genes are unresolved

obstacles for pathway-analysis approaches13 and are

beyond the scope of this paper. Furthermore, the benefits

of LD-based binning will be unique to each GWAS depend-

ing on the trait and its true underlying genetic architec-

ture, the study design, and the extent of SNP coverage.

Supplemental Data

Supplemental Data include one figure and two tables and can be

found with this article online at http://www.cell.com/AJHG.

Table 4. Pairwise Concordance between GWASs at SNP and Gene Levels

WTCCC vs. TOP WTCCC vs. German TOP vs. GermanTOP Imputed vs.German Imputed

SNP level 0.0066 (0.00018) 0.0037 (0.31) �0.0018 (0.51) �0.00023 (0.83)

Gene level (positional binning) 0.030 (1.78 3 10�7) �0.0017 (0.78) 0.023 (4.78 3 10�5) 0.068 (<2.2 3 10�16)

Gene level (LD-based binning) 0.077 (<2.2 3 10�16) 0.027 (7.24 3 10�7) 0.053 (<2.2 3 10�16) 0.098 (<2.2 3 10�16)

The Spearman rank correlation and p value (in parentheses) are shown for each pairwise comparison.

The American Journal of Human Genetics 90, 727–733, April 6, 2012 731

Acknowledgments

We acknowledge Isabel Hanson Scientific Writing for critical help

with the manuscript preparation. This work was supported by

grants from the Bergen Research Foundation, the University of

Bergen, the Research Council of Norway (FUGE, Psyksik Helse,

and eVita), UNI Computing, Western Norway Regional Health

Authority (Helse Vest), the Dr. Einar Martens Fund, South-Eastern

Norway Regional Health Authority (Helse Sør-Øst), the National

Institutes of Health and the National Heart, Lung, and Blood Insti-

tute (U01 HL089856, RO1 MH087590 and R01 MH081862), and

the German Federal Ministry of Education and Research (National

Genome Research Network 2, the National Genome Research

Network plus, and the Integrated Genome Research Network

MooDS [grant 01GS08144 to S.C.]). LDsnpR was developed within

the eSysbio project. We acknowledge Hakon Sagehaug for contrib-

uting Java code and members of the BioStar QA community for

their help and interesting discussions.

Received: September 6, 2011

Revised: February 16, 2012

Accepted: February 27, 2012

Published online: March 22, 2012

Web Resources

The URLs for data presented herein are as follows:

1,000 Genomes Project, http://www.1000genomes.org/

Catalog of Published Genome-Wide Association Studies, www.

genome.gov/gwastudies/

ENIGMAprotocol, http://enigma.loni.ucla.edu/protocols/genetics-

protocols/

HapMap Project, http://hapmap.ncbi.nlm.nih.gov/

Human Ensembl Release 54, http://may2009.archive.ensembl.

org/biomart/martview/11839bb5ec82fb10bf0333540fa09c46

IMPUTE2 Software, http://mathgen.stats.ox.ac.uk/impute/

impute_v2.html

Ingenuity Pathway Analysis, http://www.ingenuity.com/

LDsnpR, http://services.cbu.uib.no/software/ldsnpr

PLINK, http://pngu.mgh.harvard.edu/~purcell/plink/

R Archive Network, http://cran.r-project.org

References

1. International HapMap Consortium. (2005). A haplotype map

of the human genome. Nature 437, 1299–1320.

2. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M.,

Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential

etiologic and functional implications of genome-wide associa-

tion loci for human diseases and traits. Proc. Natl. Acad. Sci.

USA 106, 9362–9367.

3. Manolio, T.A. (2010). Genomewide association studies and

assessment of the risk of disease. N. Engl. J. Med. 363, 166–176.

4. Bondy, B. (2011). Genetics in psychiatry: Are the promises

met? World J. Biol. Psychiatry 12, 81–88.

5. Gershon, E.S., Alliey-Rodriguez, N., and Liu, C. (2011). After

GWAS: Searching for genetic risk for schizophrenia and

bipolar disorder. Am. J. Psychiatry 168, 253–256.

6. Stranger, B.E., Stahl, E.A., and Raj, T. (2011). Progress and

promise of genome-wide association studies for human

complex trait genetics. Genetics 187, 367–383.

7. Gibson, G. (2010). Hints of hidden heritability in GWAS. Nat.

Genet. 42, 558–560.

8. Davies, G., Tenesa, A., Payton, A., Yang, J., Harris, S.E., Lie-

wald, D., Ke, X., Le Hellard, S., Christoforou, A., Luciano,

M., et al. (2011). Genome-wide association studies establish

that human intelligence is highly heritable and polygenic.

Mol. Psychiatry 16, 996–1005.

9. Lee, S.H., Wray, N.R., Goddard, M.E., and Visscher, P.M.

(2011). Estimating missing heritability for disease from

genome-wide association studies. Am. J. Hum. Genet. 88,

294–305.

10. Cantor, R.M., Lange, K., and Sinsheimer, J.S. (2010). Priori-

tizing GWAS results: A review of statistical methods and

recommendations for their application. Am. J. Hum. Genet.

86, 6–22.

11. McCarthy, M.I., Abecasis, G.R., Cardon, L.R., Goldstein, D.B.,

Little, J., Ioannidis, J.P., and Hirschhorn, J.N. (2008). Genome-

wide association studies for complex traits: Consensus, uncer-

tainty and challenges. Nat. Rev. Genet. 9, 356–369.

12. Bergen, S.E., Balhara, Y.P., Christoforou, A., Cole, J., Degen-

hardt, F., Dempster, E., Fatjo-Vilas, M., Khedr, Y., Lopez,

L.M., Lysenko, L., et al. (2011). Summaries from the XVIII

World Congress of Psychiatric Genetics, Athens, Greece, 3-7

October 2010. Psychiatr. Genet. 21, 136–172.

13. Wang, K., Li, M., and Hakonarson, H. (2010). Analysing bio-

logical pathways in genome-wide association studies. Nat.

Rev. Genet. 11, 843–854.

14. Wang, K., Li, M., and Bucan, M. (2007). Pathway-based

approaches for analysis of genomewide association studies.

Am. J. Hum. Genet. 81, 1278–1283.

15. Neale, B.M., and Sham, P.C. (2004). The future of association

studies: Gene-based analysis and replication. Am. J. Hum.

Genet. 75, 353–362.

16. Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E.,

Ballinger, D.G., Frazer, K.A., and Cox, D.R. (2005). Whole-

genome patterns of common DNA variation in three human

populations. Science 307, 1072–1079.

17. Lawrence, R., Evans, D.M., Morris, A.P., Ke, X., Hunt, S., Pao-

lucci, M., Ragoussis, J., Deloukas, P., Bentley, D., and Cardon,

L.R. (2005). Genetically indistinguishable SNPs and their

influence on inferring the location of disease-associated vari-

ants. Genome Res. 15, 1503–1510.

18. Wellcome Trust Case Control Consortium. (2007). Genome-

wide association study of 14,000 cases of seven common

diseases and 3,000 shared controls. Nature 447, 661–678.

19. Djurovic, S., Gustafsson, O., Mattingsdal, M., Athanasiu, L.,

Bjella, T., Tesli, M., Agartz, I., Lorentzen, S., Melle, I., Morken,

G., and Andreassen, O.A. (2010). A genome-wide association

study of bipolar disorder in Norwegian individuals, followed

by replication in Icelandic sample. J. Affect. Disord. 126,

312–316.

20. Cichon, S., Muhleisen, T.W., Degenhardt, F.A., Mattheisen,

M., Miro, X., Strohmaier, J., Steffens, M., Meesters, C., Herms,

S., Weingarten, M., et al; Bipolar Disorder Genome Study

(BiGS) Consortium. (2011). Genome-wide association study

identifies genetic variation in neurocan as a susceptibility

factor for bipolar disorder. Am. J. Hum. Genet. 88, 372–381.

21. 1000 Genomes Consortium. (2010). A map of human genome

variation from population-scale sequencing. Nature 467,

1061–1073.

22. So, H.C., Gui, A.H., Cherny, S.S., and Sham, P.C. (2011). Eval-

uating the heritability explained by known susceptibility vari-

ants: A survey of ten complex diseases. Genet. Epidemiol. 35,

310–317.

732 The American Journal of Human Genetics 90, 727–733, April 6, 2012

23. Neale, B.M., and Purcell, S. (2008). The positives, protocols,

and perils of genome-wide association. Am. J. Med. Genet.

B. Neuropsychiatr. Genet. 147B, 1288–1294.

24. Blow, M.J., McCulley, D.J., Li, Z., Zhang, T., Akiyama, J.A.,

Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F.,

et al. (2010). ChIP-Seq identification of weakly conserved

heart enhancers. Nat. Genet. 42, 806–810.

25. Vandiedonck, C., Taylor, M.S., Lockstone, H.E., Plant, K., Tay-

lor, J.M., Durrant, C., Broxholme, J., Fairfax, B.P., and Knight,

J.C. (2011). Pervasive haplotypic variation in the spliceo-tran-

scriptome of the human major histocompatibility complex.

Genome Res. 21, 1042–1054.

26. Spencer, C.C., Su, Z., Donnelly, P., and Marchini, J. (2009).

Designing genome-wide association studies: Sample size,

power, imputation, and the choice of genotyping chip. PLoS

Genet. 5, e1000477.

27. Wray, N.R. (2005). Allele frequencies and the r2 measure of

linkage disequilibrium: Impact on design and interpretation

of association studies. Twin Res. Hum. Genet. 8, 87–94.

28. Santos, P.S., Hohne, J., Poerner, F., da Graca Bicalho, M.,

Uchanska-Ziegler, B., and Ziegler, A. (2011). Does the new

HapMap throw the baby out with the bath water? Eur. J.

Hum. Genet. 19, 733–734.

29. Ioannidis, J.P., Thomas, G., and Daly, M.J. (2009). Validating,

augmenting and refining genome-wide association signals.

Nat. Rev. Genet. 10, 318–329.

30. Saccone, S.F., Hinrichs, A.L., Saccone, N.L., Chase, G.A., Kon-

vicka, K., Madden, P.A., Breslau, N., Johnson, E.O., Hatsukami,

D., Pomerleau, O., et al. (2007). Cholinergic nicotinic receptor

genes implicated in a nicotine dependence association study

targeting 348 candidate genes with 3713 SNPs. Hum. Mol.

Genet. 16, 36–49.

31. Lehne, B., Lewis, C.M., and Schlitt, T. (2011). From SNPs to

genes: disease association at the gene level. PLoS ONE 6,

e20133.

32. Segre, A.V., Groop, L., Mootha, V.K., Daly, M.J., and Altshuler,

D.; DIAGRAM Consortium; MAGIC investigators. (2010).

Common inherited variation in mitochondrial genes is not

enriched for associations with type 2 diabetes or related glyce-

mic traits. PLoS Genet. 6, e1001058.

33. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira,

M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly,

M.J., and Sham, P.C. (2007). PLINK: A tool set for whole-

genome association and population-based linkage analyses.

Am. J. Hum. Genet. 81, 559–575.

34. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S.,

Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub,

T.R., Lander, E.S., and Mesirov, J.P. (2005). Gene set enrich-

ment analysis: A knowledge-based approach for interpreting

genome-wide expression profiles. Proc. Natl. Acad. Sci. USA

102, 15545–15550.

35. Holmans, P., Green, E.K., Pahwa, J.S., Ferreira, M.A., Purcell,

S.M., Sklar, P., Owen, M.J., O’Donovan, M.C., and Craddock,

N.; Wellcome Trust Case-Control Consortium. (2009). Gene

ontology analysis of GWA study data sets provides insights

into the biology of bipolar disorder. Am. J. Hum. Genet. 85,

13–24.

36. Hong, M.G., Pawitan, Y., Magnusson, P.K., and Prince, J.A.

(2009). Strategies and issues in the detection of pathway

enrichment in genome-wide association studies. Hum. Genet.

126, 289–301.

37. Zhang, K., Chang, S., Cui, S., Guo, L., Zhang, L., and Wang, J.

(2011). ICSNPathway: Identify candidate causal SNPs and

pathways from genome-wide association study by one analyt-

ical framework. Nucleic Acids Res. 39 (Web Server issue),

W437–443.

38. Li, Y., Willer, C.J., Ding, J., Scheet, P., and Abecasis, G.R.

(2010). MaCH: Using sequence and genotype data to estimate

haplotypes and unobserved genotypes. Genet. Epidemiol. 34,

816–834.

39. Howie, B.N., Donnelly, P., andMarchini, J.A. (2009). A flexible

and accurate genotype imputationmethod for the next gener-

ation of genome-wide association studies. PLoS Genet. 5,

e1000529.

The American Journal of Human Genetics 90, 727–733, April 6, 2012 733

REPORT

Rare Mutations in XRCC2 Increasethe Risk of Breast Cancer

D.J. Park,1,20 F. Lesueur,2,20 T. Nguyen-Dumont,1 M. Pertesi,2 F. Odefrey,1 F. Hammet,1 S.L. Neuhausen,3

E.M. John,4,5 I.L. Andrulis,6 M.B. Terry,7 M. Daly,8 S. Buys,9 F. Le Calvez-Kelm,2 A. Lonie,10 B.J. Pope,10

H. Tsimiklis,1 C. Voegele,2 F.M. Hilbers,11 N. Hoogerbrugge,12 A. Barroso,13 A. Osorio,13,14 the BreastCancer Family Registry, the Kathleen Cuningham Foundation Consortium for Research into FamilialBreast Cancer, G.G. Giles,15 P. Devilee,11,16 J. Benitez,13,14 J.L. Hopper,17 S.V. Tavtigian,18 D.E. Goldgar,19

and M.C. Southey1,*

An exome-sequencing study of families with multiple breast-cancer-affected individuals identified two families with XRCC2mutations,

one with a protein-truncatingmutation and one with a probably deleterious missensemutation.We performed a population-based case-

control mutation-screening study that identified six probably pathogenic coding variants in 1,308 cases with early-onset breast cancer

and no variants in 1,120 controls (the severity grading was p< 0.02). We also performed additional mutation screening in 689 multiple-

case families. We identified ten breast-cancer-affected families with protein-truncating or probably deleterious rare missense variants in

XRCC2. Our identification of XRCC2 as a breast cancer susceptibility gene thus increases the proportion of breast cancers that are asso-

ciated with homologous recombination-DNA-repair dysfunction and Fanconi anemia and could therefore benefit from specific targeted

treatments such as PARP (poly ADP ribose polymerase) inhibitors. This study demonstrates the power of massively parallel sequencing

for discovering susceptibility genes for common, complex diseases.

Currently, only approximately 30% of the familial risk for

breast cancer has been explained, leaving the substantial

majority unaccounted for.1 Recently, exome sequencing

has been demonstrated to be a powerful tool for identi-

fying the underlying cause of rare Mendelian disorders.

However, diseases such as breast cancer present substan-

tially increased complexity in terms of locus, allelic and

phenotypic heterogeneity, and relationships between

genotype and phenotype.

As part of a collaborative (Leiden University Medical

Centre, the Spanish National Cancer Center, and The

University of Melbourne) project involving the exome

capture and massively parallel sequencing of multiple-

case breast-cancer-affected families, we applied whole-

exome sequencing to DNA frommultiple affected relatives

from 13 families (family structure and sample availability

were considered before the affected relatives were chosen).

Bioinformatic analysis of the resulting exome sequences

identified a protein-truncating mutation, c.651_652del

(p.Cys217*), in X-ray repair cross complementing gene-2

(XRCC2 [MIM 600375; NM_005431.1]) in the peripheral-

blood DNA of a man participating in the Australian Breast

Cancer Family Registry2 (ABCFR; Figure 1A); this man (III-4

in Figure 1A) had been diagnosed with breast cancer at

29 years of age, and his mother (II-3), sister (III-5), and

cousin (III-1) had been diagnosed with breast cancer at

37, 41, and 34 years of age, respectively. The cousin

(III-1), who had also been selected for exome sequencing,

did not carry this mutation, the sister’s DNA was Sanger

sequenced and was found to carry the mutation, and there

was no DNA available for testing of the mother. Exome

sequencing of three individuals from a family participating

in a Dutch research study of multiple-case breast-cancer-

affected families identified a probably deleterious missense

mutation (c.271C>T [p.Arg91Trp] in XRCC2) (Figure 2) in

two sisters (II-6 and II-8 in Figure 1B) diagnosed with breast

cancer at 40 and 48 years of age, respectively, but not in

their cousin (II-1), who was diagnosed at 47 years of age.

Genotyping of XRCC2 mutations c.651_652del

(p.Cys217*) and c.271C>T (p.Arg91Trp) in 1,344 cases

1Genetic Epidemiology Laboratory, The University of Melbourne, Victoria 3010, Australia; 2Genetic Cancer Susceptibility Group, International Agency for

Research on Cancer, 69372 Lyon, France; 3Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA;4Cancer Prevention Institute of California, Fremont, CA 94538, USA; 5Department of Health Research and Policy, Stanford Cancer Center Institute, Stan-

ford, CA 94305, USA; 6Department of Molecular Genetics, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada;7Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY 10032, USA; 8Fox Chase Cancer Center, Philadelphia,

PA 19111, USA; 9Huntsman Cancer Institute, University of Utah Health Sciences Center, Salt Lake City, UT 84112, USA; 10Victorian Life Sciences Compu-

tation Initiative, Carlton, Victoria 3010, Australia; 11Department of Human Genetics, Leiden University Medical Center, Leiden, 2300 RC Leiden, The

Netherlands; 12Department of Human Genetics, Radboud University Nijmegen Medical Center, 6525 GA Nijmegen, The Netherlands; 13Human Genetics

Group, Human Cancer Genetics Program, Spanish National Cancer Center, 28029 Madrid, Spain; 14Spanish Network on Rare Diseases, 46010 Valencia,

Spain; 15Centre for Cancer Epidemiology, The Cancer Council Victoria, Carlton, Victoria 3052, Australia; 16Department of Pathology, Leiden University

Medical Center, Leiden, 2300 RC Leiden, The Netherlands; 17Centre for Molecular, Environmental, Genetic, and Analytical Epidemiology, School of Pop-

ulation Health, The University of Melbourne, Victoria 3010, Australia; 18Department of Oncological Sciences, Huntsman Cancer Institute, University of

Utah School of Medicine, Salt Lake City, UT 84112, USA; 19Department of Dermatology, University of Utah School of Medicine, Salt Lake City, UT

84132, USA20These authors contributed equally to this work

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.02.027. �2012 by The American Society of Human Genetics. All rights reserved.

734 The American Journal of Human Genetics 90, 734–739, April 6, 2012

and 1,436 controls from the Melbourne Collaborative

Cohort Study3 (MCCS) and the ABCFR revealed one

control (II-2, Figure 1C) who carried c.651_652del

(p.Cys217*). Intriguingly, this control individual’s sister

(II-1) was diagnosed with breast cancer at 63 years of age,

and her mother (I-2) was diagnosed with melanoma at

69 years of age (Figure 1C, Tables 1 and 2).

XRCC2, a RAD51 paralog, was cloned because of its

ability to complement the DNA-damage sensitivity of the

irs1 hamster cell line.4 Cells derived from Xrcc2-knockout

mice exhibit profound genetic instability as a result of

homologous recombination (HR) deficiency.5 XRCC2 is

highly conserved, and most truncations of the protein

destroy its ability to protect cells from the effects of the

DNA cross-linking agent mitomycin C.6 The involvement

of the HR DNA repair genes BRCA1 (MIM 113705),

BRCA2 (MIM 600185), ATM (MIM 607585), CHEK2 (MIM

604373), BRIP1 (MIM 605882), PALB2 (MIM 610355),

and RAD51C (MIM 602774) in breast cancer risk empha-

sizes the importance of this mechanism in the etiology

of breast cancer.7–9 Biallelic mutations in three of these

genes are associated with Fanconi anemia (FA), and, most

interestingly, Shamseldin et al.10 have recently reported

a homozygous frameshift mutation in XRCC2 as being

associated with a previously unrecognized form of FA.

XRCC2 binds directly to the C-terminal portion of the

product of the breast cancer susceptibility pathway gene

RAD51 (MIM 179617), which is central to HR.6,11 XRCC2

also complexes in vivo with RAD51B (RAD51L1 [MIM

602948]), the product of the breast and ovarian cancer

susceptibility gene RAD51C9 and the product of the

ovarian cancer risk gene RAD51D (MIM 602954),12,13 and

localizes to sites of DNA damage.6 Cells deficient in

XRCC2 also show centrosome disruption, a key compo-

nent of mitotic-apparatus dysfunction, which is often

linked to the onset of mitotic catastrophe. XRCC2 is

important in preventing chromosome missegregation

leading to aneuploidy.14 Studies of common genetic varia-

tion in XRCC2 have reported some evidence of association

with breast cancer risk (e.g., rs3218408),15 subtle effects on

DNA-repair capacity,16 and poor survival associated with

rs3218536 (XRCC2, Arg188His).15

On the basis of the exome-sequencing results, the subse-

quent genotyping of the two probably pathogenic variants

*

*

** *

*

A B

C D

EF

G H IJ

Figure 1. Pedigrees of Families Found to Carry XRCC2 MutationsMutation status is indicated for all family members for whom a DNA sample was available. Cancer diagnosis and age of onset are indi-cated for affected members. Asterisks indicate that DNA underwent exome sequencing (libraries for 50 bp fragment reads were preparedaccording to the SOLiD Baylor protocol 2.1 and the Nimblegen exome-capture protocol v.1.2 with some variations). The followingabbreviations are used: BC, breast cancer (black filled symbols); PC, pancreatic cancer; BwC, bowel cancer; UC, uterine cancer; MM,malignant melanoma; UK, unknown age; BlC, bladder cancer; OC, ovarian cancer; BCC, basal cell carcinoma; L, lung cancer; (allgray-filled symbols); V, verified cancer (via cancer registry or pathology report); and wt, wild-type. Some symbols represent more thanone person as indicated by a numeral.

The American Journal of Human Genetics 90, 734–739, April 6, 2012 735

in the MCCS and ABCFR, the rarity of these variants, and

the biochemical plausibility of XRCC2, we conducted two

further studies in parallel. The first study was case-control

mutation screening of XRCC2 (with high-resolution melt

[HRM] curve analysis followed by Sanger-sequencing

confirmation) in an additional series of 1,308 cases with

early-onset breast cancer and 1,120 frequency-matched

controls recruited through population-based sampling

by the Breast Cancer Family Registry2 (BCFR; Supplemental

Data, available online); the BCFR sampling was recently

carried out for the characterization of the breast cancer

risk associated with variants in ATM and CHEK2.17,18 The

second study was mutation screening of XRCC2 in a series

of index cases from multiple-case breast-cancer-affected

families and a series of male breast cancer cases.

The case-control mutation screening identified two cases

that carried protein-truncating variants in XRCC2: indi-

vidual III-2 had c.49C>T (p.Arg17*) (Figure 1F), and indi-

vidual II-1 had c.651_652del (p.Cys217*) (Figure 1G).

Five cases carried singleton missense substitutions ranging

from probably deleterious to relatively innocuous (accord-

ing to in silico prediction). One control carried a relatively

innocuous missense substitution (Table 2). In addition,

a case diagnosed with breast cancer at 32 years of age

carried a G>A substitution located one nucleotide prior

to the start codon.

We graded the rare missense variants by using three

computational tools: SIFT, Polyphen2.1, and Align-

GVGD. Differences in grading between these tools were

minor. Depending on which of the three computational

tools we used to grade the missense substitutions, the

statistical significances of the differences in the frequency

and severity distributions of protein-truncating variants

and rare missense substitutions between cases and controls

from the case-control mutation-screening study fell in the

range of p¼ 0.01–0.02 (adjusted for race, study center, and

age). There were six probably deleterious variants (pre-

dicted deleterious by at least two prediction algorithms)

in the cases and none in the controls, corresponding to

a p value by Fisher’s exact test of 0.02. All together, the

case-control mutation-screening data provide statistical

support for the hypothesis that rare, evolutionarily

unlikely sequence variation in XRCC2 is associated with

increased risk of breast cancer.

Mutation screening (by Sanger sequencing) of XRCC2 in

the index cases of 689 multiple-case breast-cancer-affected

families participating in the BCFR and the Kathleen

Cuningham Foundation Consortium for Research into

Familial Breast Cancer19 (kConFab) plus 150 male breast

cancer cases participating in a US-based study of male

breast cancer (Beckman Research Institute of the City of

Hope20) and kConFab revealed three rare coding-sequence

alterations. We identified a second family (from the kCon-

Fab resource) with an index case who carried XRCC2

c.651_652del (p.Cys217*); this individual (II-5, Figure 1D)

also carried a truncating mutation in BRCA1 (c.70_80del

[p.Cys24Serfs*13]). We identified an ABCFR index case

(II-2, Figure 1E and Figure 2) who carried the previously

identified missense substitution, XRCC2 c.271C>T

(p.Arg91Trp). We also identified a male breast cancer case

who carried a relatively innocuous missense substitution,

c.283A>C (p.Ile95Leu).

In addition to the protein-truncating mutations and the

above-described missense variants, a number of missense,

silent, and intronic variants were also observed in

XRCC2, and common SNPs that were reported in public

databases such as dbSNP, HapMap, or the 1,000 Genomes

Project were also identified. These included the common

coding SNP c.563G>A (p.Arg188His) (rs3218536), one

silent substitution, three 50UTR variants, five 30UTR vari-

ants, and six intronic variants in the vicinity of exon-

intron boundaries. All these variants were predicted to be

neutral according to various in silico predictions tools

(Supplemental Data, Tables 1 and 2). For common SNPs

(>1% in controls), no difference in allele frequency was

observed between cases and controls in the BCFR series.

The genetic studies included in this report received ap-

proval from The University of Melbourne Human Research

Ethics Committee, the International Agency for Research

on Cancer institutional review board (IRB), and the local

IRBs of every center from which we report findings.

Of the six distinct rare variants predicted to severely

affect protein function and identified in ourwork, twowere

truncating mutations, and four were missense changes.

Although most recognized pathogenic mutations in the

major breast cancer susceptibility genes are protein trun-

cating, there is evidence that missense mutations might

be the more prominent of some more recently-identified

Figure 2. XRCC2 Multiple-Sequence Alignment Centered onPosition Arg91Missense substitutions observed in this interval are given with themissense residue directly above the corresponding human refer-ence sequence residue. The following abbreviations are used:Hsap, Homo sapiens; Mmul, Macaca mulatta; Mmus,Mus musculus;Cfam,Canis familiaris; Lafr,Loxodonta africana;Mdom,Monodelphisdomestica; Oana, Ornithorhynchus anatinus; Ggal, Gallus gallus;Acar, Anolis coralinensis; Xtro, Xenopus tropicalis; Drer, Danio rerio;Bflo, Branchiostoma floridae; Spur, Strongylocentrotus purpuratus;Nvec, Nematostella vectensis; and Tadh, Trichoplax adhaerans. Thealignment, or updated versions thereof, is available at the Align-GVGD website (see Web Resources).

736 The American Journal of Human Genetics 90, 734–739, April 6, 2012

breast cancer susceptibility genes. For example, in compre-

hensive studies ofATM andCHEK2, the proportion of prob-

ably deleterious or pathogenic rare sequence variants that

are missense changes is often over 50%. More relevantly,

estimates of breast cancer risk are higher for missense vari-

ants than they are for protein-truncating variants. This

has been observed through case-control mutation-

screening analyses of ATM and CHEK217,18 and through

a pedigree analysis21 of ATM; in these analyses, the breast

cancer risk associated with one specific missense mutation

approaches the average risk associated with pathogenic

BRCA2 mutations. A very recent analysis of PALB2 muta-

tions found no difference in the frequency of missense

mutations between two case groups (contralateral and

unilateral breast cancer cases),22 suggesting that the contri-

bution of missense mutations to breast cancer risk might

vary between susceptibility genes.

Our finding of XRCC2 as a breast cancer susceptibility

gene expands the proportion of breast cancer that is associ-

ated with rare mutations in the HR-DNA-repair pathways

and the number of breast cancer susceptibility genes in

whichbiallelicmutations are associatedwith FA; theprecise

contribution ofmutation in these geneswill become clearer

as more whole-exome-sequencing (or whole-genome-

sequencing) and targeted-pathway-sequencing studies are

performed. XRCC2 mutations appear to be very rare, even

in the context of multiple-case families; they appear in 1

of 66 (1.5%) early-onset female breast cancer cases with

a strong family history of the disease present in the ABCFR,

compared to 9 (14%) BRCA1 mutations, 6 (9%) BRCA2

mutations, 3 (5%) TP53 (MIM 191170) mutations, and 2

(3%) PALB2mutations.

These frequencies are consistent with data from both

breast cancer linkage studies that have suggested that no

single gene is likely to account for a large fraction of the re-

maining familial aggregation of breast cancer5 and reports

from recent candidate-gene sequencing studies that have

associated other members of the HR pathway with breast

cancer susceptibility.23,24 Although mutations in HR-

DNA-repair genes are rare, it is important to identify people

whose breast cancer is associated with HR-DNA-repair

dysfunction because they could benefit from specific tar-

geted treatments such as PARP inhibitors. Unaffected rela-

tives of people with a mutation in a HR-DNA-repair gene

could also be offered predictive testing and subsequent

clinical management and genetic counseling on the basis

of their mutation status. The identification of a family

with rare mutations in both XRCC2 and BRCA1 illustrates

the complexity of the underlying genetic architecture of

breast cancer susceptibility for some families and the chal-

lenges for personalized risk-prediction models that are

incorporating an increasing array of risk factors, which

include rare mutations in breast cancer susceptibility genes

and more common genetic variation. Currently, esti-

mating the relative importance of the XRCC2 mutation

to the breast cancer risk for members of this family is diffi-

cult because of the presence of a BRCA1 protein-truncating

mutation in the proband in addition to the XRCC2 muta-

tion. Many examples have been described of individuals

and families carrying deleterious mutations in more than

Table 1. Mutation Screening in Multiple-Case Breast Cancer Families

Rare XRCC2 VariantsEffect onProtein Align-GVGDa SIFTb

PolyPhen-2.1(HumDiv)

Case orControl

Pedigree(Study Source)

Age and Originof Carrier

Truncating variants

c.651_652del p.Cys217* � � � case Figure 1A (ABCFR)e 29, white

c.651_652del p.Cys217* � � � casec Figure 1C (kConFab) 36, white

c.651_652del p.Cys217* � � � control Figure 1D (MCCS) 72, white

Missense substitutions

c.271C>T p.Arg91Trp C65 0.00 probably damaging case Figure 1B (Dutch)e 40, white

c.271C>T p.Arg91Trp C65 0.00 probably damaging cased Figure 1E (ABCFR) 32, white

c.283A>C p.Ile95Val C0 0.34 benign case � (kConFab) 59, white

c.283A>G p.Ile95Leu C0 0.41 benign case � (kConFab) 70, white

c.283A>C p.Ile95Val C0 0.34 benign case � (BRICOH) 68, white

Silent substitution

c.582G>T p.Thr194Thr � � � case � (kConFab) 60, white

The following abbreviations are used: ABCFR; Australian Breast Cancer Family Registry; kConFab, Kathleen Cuningham Foundation Consortium for Research intoFamilial Breast Cancer; MCCS, Melbourne Collaborative Cohort Study; and BRICOH, Beckman Research Institute of City of Hope.aProtein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo).bPMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh).cThis woman also carries BRCA1 c.70_80del (p.Cys24Serfs*13).dThis carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-InternationalAgency for Research on Cancer) case-control screening.eFamily included in the exome-sequencing phase.

The American Journal of Human Genetics 90, 734–739, April 6, 2012 737

one proven breast cancer susceptibility gene; one such

example is the co-observation of BRCA1, BRCA2, ATM,

and CHEK2 mutations.21,25

This study demonstrates the power of massively parallel

sequencing in the discovery of additional breast cancer

susceptibility genes when used with an appropriate study

design. Our approach could be applied to other common,

complex diseases with components of unexplained herita-

bility.

Supplemental Data

Supplemental Data include 6 tables and can be found with this

article online at http://www.cell.com/AJHG.

Acknowledgments

This work was supported by Cancer Council Victoria (grant

628774), the National Institutes of Health (R01CA155767 and

R01CA121245), the Australian National Health and Medical

Research Council (grant 466668), The University of Melbourne

(infrastructure award to J.L.H.), a Victorian Life Sciences Computa-

tion Initiative grant (VR00353) on its Peak Computing Facility at

the University of Melbourne, and an initiative of the Victorian

Government and Dutch Cancer Society (grant UL 2009-4388).

The research resources, including the Melbourne Collaborative

Cohort Study, theAustralianBreast Cancer Family Study, the Breast

Cancer Family Registry, and the Kathleen Cuningham Foundation

Consortium for Research into Familial Breast Cancer, are further

acknowledged in the supplementary information. We wish to

thankNivonirina Robinot andGeoffroyDurand for their technical

help during the case-control mutation screening at the Interna-

tional Agency for Research on Cancer, Georgia Chenevix-Trench

for her support of and contribution to the establishment of the

case-control mutation-screening study, and Greg Wilhoite for

sequencing the male breast cancer cases at the Beckman Research

Institute of City of Hope. This work and partial support for S.L.N.

was provided by the Morris and Horowitz Families Endowment.

Work at the Spanish National Cancer Center was partially funded

by the Spanish Association Against Cancer and Health Ministry

(FIS08/1120). M.C.S. is a National Health and Medical Research

Council (NHMRC) Senior Research Fellow and a Victorian Breast

Cancer Research Consortium (VBCRC) Group Leader. J.L.H. is

a NHMRC Australia Fellow and a VBCRC Group Leader. T.N.-D. is

a Susan G. Komen for the Cure Postdoctoral Fellow.

Received: November 20, 2011

Revised: January 16, 2012

Accepted: February 29, 2012

Published online: March 29, 2012

Web Resources

The URLs for data presented herein are as follows:

Align-GVGD, http://agvgd.iarc.fr/alignments

GATK v.1.0.4418, http://gatk.sourceforge.net/

Genome Viewer (IGV v.1.5.48), http://www.broadinstitute.org/

software/igv/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

Picard v.1.29, http://sourceforge.net/projects/picard/

PolyPhen2.1, http://genetics.bwh.harvard.edu./pph2/

SIFT, http://sift.jcvi.org/

SOLiD Baylor protocol 2.1, http://www.hgsc.bcm.tmc.edu/

documents/Preparation_of_SOLiD_Capture_Libraries.pdf

UCSC Genome Browser, http://genome.ucsc.edu/cgi-bin/

hgGateway

Table 2. Case-Control Mutation Screening Applied to the BCFR Population-Based Study

Rare XRCC2 VariantsEffect onProtein Align-GVGDa SIFTb

PolyPhen-2.1(HumDiv)

Case (n ¼ 1,308) orControl (n ¼ 1,120)

Pedigree(BCFR)

Age and Originof Carrier

Truncating variants

c.49C>T p.Arg17* � � � case Figure 1F 33, white

c.46G>T p.Ala16Ser C0 0.24 benign case � 44, East Asian

c.181C>A p.Leu61Ile C0 0.00 possibly damaging case Figure 1H 30, East Asian

c.271C>T p.Arg91Trp C65 0.00 probably damaging casec Figure 1E 32, white

c.283A>G p.Ile95Val C0 0.34 benign control � 44, white

c.693G>T p.Trp231Cys C65 0.00 probably damaging cased Figure 1I 44, East Asian

c.808T>G p.Phe270Val C45 0.00 probably damaging case Figure 1J 38, African

Silent substitution

c.354G>A p.Val118Val � � � cased � 44, East Asian

50 UTR variants

c.-1G>A ? � � � casee � 32, white

The following abbreviation is used: BCFR, Breast Cancer Family Registry.aProtein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo).bPMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh).cThis carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-InternationalAgency for Research on Cancer) case-control screening.dThis 44-year-old East Asian case carries p.Trp231Cys and p.Val118Val.eThis case is considered a ‘‘noncarrier’’ in the analysis.

738 The American Journal of Human Genetics 90, 734–739, April 6, 2012

References

1. Turnbull, C., and Rahman, N. (2008). Genetic predisposition

to breast cancer: Past, present, and future. Annu. Rev. Geno-

mics Hum. Genet. 9, 321–345.

2. John, E.M., Hopper, J.L., Beck, J.C., Knight, J.A., Neuhausen,

S.L., Senie, R.T., Ziogas, A., Andrulis, I.L., Anton-Culver, H.,

Boyd, N., et al; Breast Cancer Family Registry. (2004). The

Breast Cancer Family Registry: An infrastructure for coopera-

tive multinational, interdisciplinary and translational studies

of the genetic epidemiology of breast cancer. Breast Cancer

Res. 6, R375–R389.

3. Giles, G.G., and R, E.D. (2002). The Melbourne Collaborative

Cohort Study. IARC Sci Publ 156, 2.

4. Cartwright, R., Tambini, C.E., Simpson, P.J., and Thacker, J.

(1998). The XRCC2 DNA repair gene from human and mouse

encodes a novel member of the recA/RAD51 family. Nucleic

Acids Res. 26, 3084–3089.

5. Deans, B., Griffin, C.S., O’Regan, P., Jasin, M., and Thacker, J.

(2003). Homologous recombination deficiency leads to

profound genetic instability in cells derived from Xrcc2-

knockout mice. Cancer Res. 63, 8181–8187.

6. Tambini, C.E., Spink, K.G., Ross, C.J., Hill, M.A., and Thacker,

J. (2010). The importance of XRCC2 in RAD51-related DNA

damage repair. DNA Repair (Amst.) 9, 517–525.

7. Moynahan,M.E., Chiu, J.W., Koller, B.H., and Jasin,M. (1999).

Brca1 controls homology-directed DNA repair. Mol. Cell 4,

511–518.

8. Moynahan, M.E., Pierce, A.J., and Jasin, M. (2001). BRCA2 is

required for homology-directed repair of chromosomal breaks.

Mol. Cell 7, 263–272.

9. Meindl, A., Hellebrand, H., Wiek, C., Erven, V., Wappensch-

midt, B., Niederacher, D., Freund, M., Lichtner, P., Hartmann,

L., Schaal, H., et al. (2010). Germline mutations in breast and

ovarian cancer pedigrees establish RAD51C as a human cancer

susceptibility gene. Nat. Genet. 42, 410–414.

10. Shamseldin, H.E., Elfaki, M., and Alkuraya, F.S. (2012). Exome

sequencing reveals a novel Fanconi group defined by XRCC2

mutation. J. Med. Genet. 49, 184–186.

11. Gao, L.-B., Pan, X.-M., Li, L.-J., Liang, W.-B., Zhu, Y., Zhang,

L.-S., Wei, Y.-G., Tang, M., and Zhang, L. (2011). RAD51

135G/C polymorphism and breast cancer risk: Ameta-analysis

from 21 studies. Breast Cancer Res. Treat. 125, 827–835.

12. Loveday, C., Turnbull, C., Ramsay, E., Hughes, D., Ruark, E.,

Frankum, J.R., Bowden, G., Kalmyrzaev, B., Warren-Perry,

M., Snape, K., et al; Breast Cancer Susceptibility Collaboration

(UK). (2011). Germlinemutations in RAD51D confer suscepti-

bility to ovarian cancer. Nat. Genet. 43, 879–882.

13. Liu, N., Schild, D., Thelen, M.P., and Thompson, L.H. (2002).

Involvement of Rad51C in two distinct protein complexes

of Rad51 paralogs in human cells. Nucleic Acids Res. 30,

1009–1015.

14. Griffin, C.S., Simpson, P.J., Wilson, C.R., and Thacker, J.

(2000). Mammalian recombination-repair genes XRCC2 and

XRCC3 promote correct chromosome segregation. Nat. Cell

Biol. 2, 757–761.

15. Lin,W.-Y., Camp, N.J., Cannon-Albright, L.A., Allen-Brady, K.,

Balasubramanian, S., Reed, M.W.R., Hopper, J.L., Apicella, C.,

Giles, G.G., Southey, M.C., et al. (2011). A role for XRCC2

gene polymorphisms in breast cancer risk and survival. J.

Med. Genet. 48, 477–484.

16. Rafii, S., O’Regan, P., Xinarianos, G., Azmy, I., Stephenson, T.,

Reed, M., Meuth, M., Thacker, J., and Cox, A. (2002). A poten-

tial role for the XRCC2 R188H polymorphic site in DNA-

damage repair and breast cancer. Hum. Mol. Genet. 11,

1433–1438.

17. Le Calvez-Kelm, F., Lesueur, F., Damiola, F., Vallee, M.,

Voegele, C., Babikyan, D., Durand, G., Forey, N., McKay-

Chopin, S., Robinot, N., et al; Breast Cancer Family Registry.

(2011). Rare, evolutionarily unlikely missense substitutions

in CHEK2 contribute to breast cancer susceptibility: results

from a breast cancer family registry case-control mutation-

screening study. Breast Cancer Res. 13, R6.

18. Tavtigian, S.V., Oefner, P.J., Babikyan, D., Hartmann, A.,

Healey, S., Le Calvez-Kelm, F., Lesueur, F., Byrnes, G.B.,

Chuang, S.-C., Forey, N., et al; Australian Cancer Study; Breast

Cancer Family Registries (BCFR); Kathleen Cuningham

Foundation Consortium for Research into Familial Aspects

of Breast Cancer (kConFab). (2009). Rare, evolutionarily

unlikely missense substitutions in ATM confer increased risk

of breast cancer. Am. J. Hum. Genet. 85, 427–446.

19. Mann, G.J., Thorne, H., Balleine, R.L., Butow, P.N., Clarke,

C.L., Edkins, E., Evans, G.M., Fereday, S., Haan, E., Gattas,

M., et al; Kathleen Cuningham Consortium for Research in

Familial Breast Cancer. (2006). Analysis of cancer risk and

BRCA1 and BRCA2 mutation prevalence in the kConFab

familial breast cancer resource. Breast Cancer Res. 8, R12.

20. Ding, Y.C., Steele, L., Chu, L.-H., Kelley, K., Davis, H., John,

E.M., Tomlinson, G.E., and Neuhausen, S.L. (2011). Germline

mutations in PALB2 in African-American breast cancer cases.

Breast Cancer Res. Treat. 126, 227–230.

21. Goldgar, D.E., Healey, S., Dowty, J.G., Da Silva, L., Chen, X.,

Spurdle, A.B., Terry, M.B., Daly, M.J., Buys, S.M., Southey,

M.C., et al; BCFR; kConFab. (2011). Rare variants in the

ATM gene and risk of breast cancer. Breast Cancer Res. 13, R73.

22. Tischkowitz, M., Capanu, M., Sabbaghian, N., Li, L., Liang, X.,

Vallee, M.P., Tavtigian, S.V., Concannon, P., Foulkes, W.D.,

Bernstein, L., et al; The WECARE Study Collaborative Group.

(2012). Rare germline mutations in PALB2 and breast cancer

risk: A population-based study. Hum Mutat 33, 674–680.

23. Rahman, N., Seal, S., Thompson, D., Kelly, P., Renwick, A.,

Elliott, A., Reid, S., Spanova, K., Barfoot, R., Chagtai, T., et al;

Breast Cancer Susceptibility Collaboration (UK). (2007).

PALB2, which encodes a BRCA2-interacting protein, is a breast

cancer susceptibility gene. Nat. Genet. 39, 165–167.

24. Seal, S., Thompson, D., Renwick, A., Elliott, A., Kelly, P.,

Barfoot, R., Chagtai, T., Jayatilake, H., Ahmed, M., Spanova,

K., et al; Breast Cancer Susceptibility Collaboration (UK).

(2006). Truncating mutations in the Fanconi anemia J gene

BRIP1 are low-penetrance breast cancer susceptibility alleles.

Nat. Genet. 38, 1239–1241.

25. Turnbull, C., Seal, S., Renwick, A., Warren-Perry, M., Hughes,

D., Elliott, A., Pernet, D., Peock, S., Adlard, J.W., Barwell, J.,

et al; Breast Cancer Susceptibility Collaboration (UK),

EMBRACE. (2012). Gene-gene interactions in breast cancer

susceptibility. Hum. Mol. Genet. 21, 958–962.

The American Journal of Human Genetics 90, 734–739, April 6, 2012 739

REPORT

Exome Sequencing Identifies PDE4DMutations as Another Cause of Acrodysostosis

Caroline Michot,1,10 Carine Le Goff,1,10 Alice Goldenberg,2 Avinash Abhyankar,3 Celine Klein,1

Esther Kinning,4 Anne-Marie Guerrot,2 Philippe Flahaut,5 Alice Duncombe,6 Genevieve Baujat,1

Stanislas Lyonnet,1 Caroline Thalassinos,7 Patrick Nitschke,8 Jean-Laurent Casanova,3,9

Martine Le Merrer,1 Arnold Munnich,1 and Valerie Cormier-Daire1,*

Acrodysostosis is a rare autosomal-dominant condition characterized by facial dysostosis, severe brachydactyly with cone-shaped epiph-

yses, and short stature. Moderate intellectual disability and resistance to multiple hormones might also be present. Recently, a recurrent

mutation (c.1102C>T [p.Arg368*]) in PRKAR1A has been identified in three individuals with acrodysostosis and resistance to multiple

hormones. After studying ten unrelated acrodysostosis cases, we report here de novo PRKAR1A mutations in five out of the ten individ-

uals (we found c.1102C>T [p.Arg368*] in four of the ten and c.1117T>C [p.Tyr373His] in one of the ten). We performed exome

sequencing in two of the five remaining individuals and selected phosphodiesterase 4D (PDE4D) as a candidate gene. PDE4D encodes

a class IV cyclic AMP (cAMP)-specific phosphodiesterase that regulates cAMP concentration. Exome analysis detected heterozygous

PDE4D mutations (c.673C>A [p.Pro225Thr] and c.677T>C [p.Phe226Ser]) in these two individuals. Screening of PDE4D identified

heterozygous mutations (c.568T>G [p.Ser190Ala] and c.1759A>C [p.Thr587Pro]) in two additional acrodysostosis cases. These

mutations occurred de novo in all four cases. The four individuals with PDE4D mutations shared common clinical features, namely

characteristic midface and nasal hypoplasia and moderate intellectual disability. Metabolic screening was normal in three of these

four individuals. However, resistance to parathyroid hormone and thyrotropin was consistently observed in the five cases with PRKAR1A

mutations. Finally, our study further supports the key role of the cAMP signaling pathway in skeletogenesis.

Acrodysostosis (MIM 101800) is a dominantly inherited

condition consisting of (1) skeletal dysplasia characterized

by facial dysostosis with nasal hypoplasia (a depressed

nasal bridge and prominent mandible), severe brachydac-

tyly with short broad metatarsals, metacarpals, and

phalanges, cone-shaped epiphyses, advanced bone matu-

ration, spinal stenosis, and short stature; (2) resistance to

multiple hormones, including parathyroid hormone and

thyrotropin; and (3) possible neurological involvement

(moderate to mild intellectual disability).1,2 Differential

diagnoses include Albright hereditary osteodystrophy

(MIM 103580) and pseudopseudohypoparathyroidism

(MIM 612463), which are both due to loss-of-function

mutations in GNAS (a-stimulary subunit of the G protein)

(MIM 139320) and are characterized by less severe hand

and foot involvement.3

A recurrent c.1102C>T mutation in PRKAR1A (MIM

188830) has been recently identified in three cases of acro-

dysostosis with resistance to multiple hormones.4 This

gene encodes the cyclic AMP (cAMP)-dependent regula-

tory subunit of protein kinase A. The mutated subunit

impairs the protein-kinase-A response to cAMP and

accounts for hormone resistance and skeletal abnormali-

ties resembling those observed in Albright hereditary

osteodystrophy.

After studying ten unrelated individuals with acrodysos-

tosis, we found PRKAR1A mutations in five out of the

ten, and we show that most of the remaining cases were

accounted for by mutations in phosphodiesterase, 4D

(PDE4D [MIM 600129]), which is also involved in cAMP

metabolism.

Ten unrelated cases were included in this study. There

was no family history, and each individual was the only

affected member in his family. Inclusion criteria were the

following: (1) the presence of severe generalized brachy-

dactyly affecting metacarpals and phalanges and associ-

ated with cone-shaped epiphyses and (2) the exclusion of

Albright hereditary osteodystrophy on the basis of normal

bioactivity of the Gs alpha subunit and normal GNAS

sequencing.

We performed a complete screening of phosphocalcic

metabolism and blood levels of creatinine, calcium,

phosphorus, thyroxin, thyrotropin, 25-hydroxyvitaminD,

1,25-dihydroxyvitaminD, parathyroid hormone (PTH),

and fibroblast growth factor 23, as well as urinary levels

of creatinine, calcium, and phosphorus. The clinical

1Unite Institut National de la Sante et de la Recherche Medicale U781, Departement de Genetique, Universite Paris Descartes, Sorbonne Paris Cite, Hopital

Necker Enfants Malades, Paris 75015, France; 2Service de Genetique Medicale, Centre Hospitalier Universitaire-Hopitaux de Rouen, Rouen 76100, France;3St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA; 4The Ferguson-

Smith Centre for Clinical Genetics, Royal Hospital for Sick Children-Yorkhill, Dalnair Street, Glasgow G3 8SJ, Scotland; 5Service de Pediatrie, Centre Hos-

pitalier Universitaire-Hopitaux de Rouen, Rouen 76100, France; 6Service d’Ophtalmologie, Centre Hospitalier Universitaire-Hopitaux de Rouen, Rouen

76100, France; 7Endocrinologie Gynecologie Diabetologie Pediatrique, Assistance Publique-Hopitaux de Paris, Paris 75015, France; 8Plateforme de Bioin-

formatique, Universite Paris Descartes, Paris 75015, France; 9Unite Institut National de la Sante et de la Recherche Medicale U980, Laboratory of Human

Genetics of Infectious Diseases, Necker Medical School, University Paris Descartes, Paris 75015, France10These authors contributed equally to this work

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.03.003. �2012 by The American Society of Human Genetics. All rights reserved.

740 The American Journal of Human Genetics 90, 740–745, April 6, 2012

radiological and biochemical details are summarized in

Table 1 and Figure 1.

Informed consent for participation, sample collection,

and photograph publication was obtained via protocols

approved by the Necker Hospital ethics committee.

We sequenced PRKAR1A (RefSeq accession number

NM_002734.3) by using specific primers (available upon

request) in the ten individuals. De novo PRKAR1A muta-

tions, including the recurrent mutation,4 were identified

in five out of the ten individuals (c.1102C>T [p.Arg368*]

was found in four of the five, and c.1117T>C [p.Tyr373His]

was found in one of the five) (Table 1). This missense

mutation was predicted to be damaging by PolyPhen, was

found to alter a conserved amino acid located in the cata-

lytic domain, and was not identified in alleles from 200

ethnically matched controls.

The exclusion of PRKAR1A in five acrodysostosis cases

prompted us to perform exome sequencing in two of these

five individuals. Exome capture was performed with the

SureSelect Human All Exon kit (Agilent Technologies).5

Single-end sequencing was performed on an Illumina

Genome Analyzer IIx (Illumina) and generated 72 bp

reads. For sequence alignment, variant calling, and anno-

tation, we aligned the sequences to the human genome

reference sequence (hg18 build) by using the Burrows-

Wheeler Aligner.6 Downstream processing was carried out

with the Genome Analysis Toolkit (GATK7), SAMtools,8

and Picard tools. Substitution calls were made with

GATK Unified Genotyper, whereas indel calls were made

with a GATK IndelGenotyperV2. All calls with a read

coverage %23 and a Phred-scaled SNP quality of %20

were filtered out. All the variants were annotated with an

in-house -developed annotation software system. We first

focused our analyses on nonsynonymous variants, splice-

acceptor and donor-site mutations, and coding indels

because we anticipated that synonymous variants would

be far less likely to cause disease (Table S1, available

online). We also defined variants as previously unidenti-

fied if they were absent from control populations and

from all datasets, including dbSNP129, the 1000 Genomes

Project, and in-house exome data.

On the basis of the dominant mode of inheritance of

acrodysostosis, we selected eight candidate genes that all

harbor heterozygous mutations (Table S2). Given the

involvement of PRKAR1A, a cAMP-activated protein kinase

A, in some acrodysostosis cases,4 we then only considered

gene(s) that encode proteins involved in the cAMP

signaling pathway. Therefore, we regarded PDE4D (RefSeq

accession number NM_001104631) as the best candidate

gene. Indeed,PDE4D encodes a class IVcAMP-specificphos-

phodiesterase that regulates cAMP concentration. Exome

analysis detected two PDE4D mutations (c.673C>A

[p.Pro225Thr] and c.677T>C [p.Phe226Ser]) in the two

individuals. These results were confirmed by Sanger

sequencing. Subsequent screening of the 15 PDE4D coding

exons in the three remaining cases led to the identifica-

tion of two distinct heterozygous missense mutations

(c.568T>G [p.Ser190Ala] and c.1759A>C [p.Thr587Pro])

in two additional cases. Thesemutations were not observed

in the parents of acrodysostosis-affected individuals, con-

firming that they occurred de novo.

We identified a total of four distinct heterozygous

PDE4D mutations in four individuals (Table 1). Among

them, the p.Ser190Ala substitution affected a serine

residue predicted to be phosphorylated (Uniprot database),

and the p.Thr587Pro substitution disturbed the conserved

catalytic PDEase_I domain (pfam database), which confers

the 3050-cyclic nucleotide phosphodiesterase activity. The

two remaining alterations (p.Pro225Thr and p.Phe226Ser)

affected conserved residues across species. All four muta-

tions were considered to be pathogenic by PolyPhen

and were absent from alleles in 200 ethnically matched

controls.

Here, we report PDE4D mutations in four unrelated

cases of acrodysostosis and PRKAR1A mutations in five

cases. All mutations occurred de novo, providing further

evidence that acrodysostosis has a dominant mode of

inheritance.

After we divided up the acrodysostosis-affected individ-

uals and grouped them according to the mutations they

had, our study revealed interesting genotype-phenotype

correlations. Indeed, the four individuals with PDE4D

mutations shared characteristic facial features, namely

midface hypoplasia with the canonical nasal hypoplasia

initially reported in acrodysostosis and moderate intellec-

tual disability with speech delay.1,2 The characteristic

facial dysostosis and intellectual disability were neither

observed in our individuals with PRKAR1A mutations nor

mentioned in the three previously reported cases.4 Along

the same lines, hormone resistance was observed in only

one person with PDE4D mutations—case 6 had increased

PTH levels and normal serum-phosphate levels—whereas

hormone resistance was consistently observed in individ-

uals carrying PRKAR1A mutations (all five suffered from

chronic resistance to parathyroid hormone, and four of

the five had peripheral hypothyroidism). Although our

study does not allow generalized conclusions, our find-

ings might suggest that individuals with facial dysostosis

and moderate intellectual disability should be screened

for PDE4D mutations, whereas individuals with less char-

acteristic facial features, no intellectual disability, and

hormone resistance should be screened for mutations in

PRKAR1A.

The five individuals harboring PRKAR1A mutations pre-

sented with growth retardation (<�2 standard deviations

[SDs]) and decreased growth speed in late childhood

(between 7 and 13 years of age). The adult individuals

had a final height <�3 SDs. Alternatively, the cases

harboring PDE4D mutations presently have normal

growth charts, but they are only 3–7 years old, and predict-

ing final adult height is therefore impossible. It is worth

noting that two out of the four PDE4D cases presented

with an acute intracranial hypertension due to sinus

thrombosis; both of these individuals required derivation

The American Journal of Human Genetics 90, 740–745, April 6, 2012 741

Table 1. Clinical, Radiological, and Biochemical Data of the Ten Acrodysostosis Individuals Reported

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6 Patient 7 Patient 8 Patient 9 Patient 10

Sex female male female female female male male male male female

PRKAR1A mutation c.1102C>T c.1102C>T c.1102C>T c.1117T>C c.1102C>T � � � � �

PDE4D mutation � � � � � c.673C>A c.677T>C c.568T>G c.1759A>C �

IUGR no no no yes no yes no no no no

Postnatal growthretardation (<�2 SDs)

yes(26 years old)

no(8 years old)

yes(13 years old)

yes(22 years old)

yes(34 years old)

no(7 years old)

no(4 years old)

no(4 years old)

no(3 years old)

yes(38 years old)

Advanced bone age � yes yes � � yes yes yes yes �

Facial dysostosis

Nasal hypoplasia no no no no no yes yes yes yes no

Depressed nasal bridge no no yes no yes yes yes yes yes no

Prominent mandible no no no no yes no no yes no yes

Peripheral dysostosis

Severe brachydactyly yes yes yes yes yes yes yes yes yes yes

Short metatarsals,metacarpals, andphalanges

yes yes yes yes yes yes yes yes yes yes

Cone-shaped epiphyses yes(childhood)

yes yes yes(childhood)

yes(childhood)

yes yes yes yes nd

Hormonal screening

PTH (n ¼ 10–46 ng/l) 95 79 116 84 142 76 39 24 19 normal

Calcemia(n ¼ 2.2–2.7 mmol/l)

2.35 2.47 2.4 2.57 2.37 2.18 2.47 2.4 2.5 2.38

Phosphoremia(n ¼ 1.3–1.85 mmol/l)

1.23 1.56 1.76 nd 1.3 1.54 1.68 1.81 1.7 1.8

25-OHvitD(n ¼ 30–80 ng/ml)

nd 26 16 nd 22 26 25 45 30 nd

1,25-diOHvitD (pg/ml) nd 39 106 65 53 nd nd nd

FGF23 (n ¼ 1–120 UI/ml) nd nd nd nd 145 90 112 171.2 60.9 nd

Free T4(n ¼ 7.5–15 pmol/l)

8.65 hypothyroidism hypothyroidism hypothyroidism 16.71 treatment 10.4 10 14.3 17 17

TSH (n ¼ 0.34–5.6 mUI/l) 2.67 13.41 15.42 increased 0.16 2.59 2.51 3.58 2.77 1.8l

Calciuria(n ¼ 1.5–6 mmol/l)

nd <0.2 nd nd nd 0.86 1.28 1.44 2.27 nd

742

TheAmerica

nJournalofHumanGenetics

90,740–745,April

6,2012

surgery and medical treatment. This observation should

prompt the careful investigation of headache complaints

in such cases. This feature, hitherto unreported in

PRKAR1A-mutation-positive individuals, might be another

distinctive characteristic specific to the clinical spectrum of

symptoms associated with PDE4D mutations. Finally,

neither PDE4D nor PRKAR1A mutations were found in

one adult individual who had characteristic skeletal

features but no hormone resistance or facial dysostosis.

One cannot exclude a molecular defect not detectable by

Sanger sequencing, but it is also conceivable that other

genes might account for acrodysostosis.

Considering PRKAR1A mutations, we confirm that

c.1102C>T is a recurrent mutation observed in seven of

the eight patients reported so far, whereas only one

missense mutation that changes a conserved amino acid

located in the cAMP binding domain has been identified.

Interestingly, the Arg388* substitution is considered

a gain-of-function mutation because it decreases protein-

kinase-A sensitivity to cAMP.4 In contrast, germ-line loss-

of-function mutations resulting in constitutive activation

of protein kinase A are responsible for Carney complex

(MIM 160980), an autosomal-dominant multiple-

neoplasia syndrome characterized by cardiac, endocrine,

cutaneous, and neural myxomatous tumors and pig-

mented lesions of the skin and mucosae.9

All mutations that we have identified in PDE4D are

heterozygous missense mutations and are presumably

responsible for impaired phosphodiesterase activity.

PDE4D belongs to the cAMP-hydrolyzing phosphodies-

terase family, which is directly involved in the rate of

cAMP degradation. Considering the crucial role of cAMP

in intracellular signaling in response to a number of

membrane-impermeable hormones, a dysregulation of

cAMP levels could be the underlying mechanism of the

acrodysostosis that results from PDE4D mutations.

The cAMP-specific PDE4 family is widely expressed, and

PDE4 isoforms have similar catalytic functions, but they

have distinct cellular functions because of differences in

specific intracellular trafficking and signaling-complex

formation.10,11 PDE4D uses different promoters to

generate multiple alternatively spliced transcript variants

(at least nine) that encode functional proteins; this might

explain the phenotype variability observed in the four re-

ported cases.

Of note, mice deficient in PDE4D exhibited delayed

growth and female infertility due to impaired ovula-

tion;12 these two symptoms have also been described in

acrodysostosis cases.13 Mouse models have also revealed

that PDE4D plays a critical role in the memory and hippo-

campal neurogenesis mediated by cAMP signaling.14

Flies deficient in the PDE4D homolog, dunce, also display

impaired central-nervous-system and reproductive func-

tion.15 All together, these data support the involvement

of PDE4D impairment in the regulation of cAMP signal-

ing, especially in growth and central-nervous-system

development.Table

1.

Continued

Patient1

Patient2

Patient3

Patient4

Patient5

Patient6

Patient7

Patient8

Patient9

Patient10

Phosp

haturia

(n¼

10–5

0mmol/l)

nd

14.8

nd

nd

nd

13.76

7.12

36.7

4.70

nd

Creatininuria(m

mol/l)

nd

4.5

nd

nd

nd

6.3

2.08

11.7

4.15

nd

Neurolo

gy

Intellectual

disab

ility

no

no

no

no

no

yes;sp

eech

delay

requiring

orthophony;

fine-motor-skill

impairm

ent

yes;sp

eech

delay

requiring

orthophony;

fine-motor-skill

impairm

ent

yes;sp

eech

delay

;psych

omotor

delay

(walked

at17month

sofag

e)

yes;sp

eech

delay

;psych

omotor

delay

requiring

physioth

erap

y

no

Oth

ersp

inal

sten

osis;

carpal

tunnel

intracranial

hypertension

withjugular

sten

osisrequiring

derivation

intracranial

hypertensionwith

thrombophlebitis

ofth

etran

sverse

sinus

andjugular(treated

byacetazolamide,

antico

agulant,

andderivation)

spinal

sten

osis

Thefollo

wingabbreviationsare

used:IUGR,intrauterinegrowth

retardation;SDs,standard

deviations;nd,notdone;PTH,parathyroid

horm

one;25-O

HvitD

,25-hydroxyvitaminD;1,25-diO

HvitD

,1,25-dihydroxyvitaminD;

FGF2

3,fibroblast

growth

factor23;T4,thyroxin;andTSH,thyrotropin.

The American Journal of Human Genetics 90, 740–745, April 6, 2012 743

Finally, our findings further support the key role of the

cAMP signaling pathway in skeletogenesis, as previously

shown for Albright hereditary osteodystrophy due to

GNAS mutations. Ongoing studies will highlight the

specific link between PRKAR1A and PDE4D, which are

both involved in cAMP signaling and responsible for acro-

dysostosis.

Supplemental Data

Supplemental Data include two tables and can be found with this

article online at http://www.cell.com/AJHG.

Acknowledgments

We thank all individuals and their families for their contribution

to this work.

Received: December 9, 2011

Revised: January 11, 2012

Accepted: March 6, 2012

Published online: March 29, 2012

Web Resources

The URLs for data presented herein are as follows:

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

Pfam, http://www.sanger.ac.uk/resources/databases/pfam.html

Picard Tools, http://picard.sourceforge.net

Polyphen, http://genetics.bwh.harvard.edu/pph/

Uniprot, http://www.uniprot.org/

References

1. Maroteaux, P., and Malamut, G. (1968). Acrodysostosis. Presse

Med. 76, 2189–2192.

2. Robinow, M., Pfeiffer, R.A., Gorlin, R.J., McKusick, V.A.,

Renuart, A.W., Johnson, G.F., and Summitt, R.L. (1971).

Acrodysostosis. A syndrome of peripheral dysostosis, nasal

hypoplasia, and mental retardation. Am. J. Dis. Child. 121,

195–203.

3. Bastepe, M., and Juppner, H. (2005). GNAS locus and pseudo-

hypoparathyroidism. Horm. Res. 63, 65–74.

4. Linglart, A., Menguy, C., Couvineau, A., Auzan, C., Gunes,

Y., Cancel, M., Motte, E., Pinto, G., Chanson, P., Bougneres,

P., et al. (2011). Recurrent PRKAR1A mutation in acrodysos-

tosis with hormone resistance. N. Engl. J. Med. 364, 2218–

2226.

5. Byun, M., Abhyankar, A., Lelarge, V., Plancoulaine, S., Palan-

duz, A., Telhan, L., Boisson, B., Picard, C., Dewell, S., Zhao,

C., et al. (2010). Whole-exome sequencing-based discovery

of STIM1 deficiency in a child with fatal classic Kaposi

sarcoma. J. Exp. Med. 207, 2307–2312.

6. Li, H., and Durbin, R. (2009). Fast and accurate short read

alignment with Burrows-Wheeler transform. Bioinformatics

25, 1754–1760.

7. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis,

K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly,

M., and DePristo, M.A. (2010). The Genome Analysis Toolkit:

a MapReduce framework for analyzing next-generation DNA

sequencing data. Genome Res. 20, 1297–1303.

8. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer,

N., Marth, G., Abecasis, G., and Durbin, R.; 1000 Genome

Project Data Processing Subgroup. (2009). The Sequence

Alignment/Map format and SAMtools. Bioinformatics 25,

2078–2079.

9. Kirschner, L.S., Carney, J.A., Pack, S.D., Taymans, S.E., Giatza-

kis, C., Cho, Y.S., Cho-Chung, Y.S., and Stratakis, C.A. (2000).

Mutations of the gene encoding the protein kinase A type I-

alpha regulatory subunit in patients with the Carney complex.

Nat. Genet. 26, 89–92.

10. Rall, T.W., and Sutherland, E.W. (1958). Formation of a cyclic

adenine ribonucleotide by tissue particles. J. Biol. Chem. 232,

1065–1076.

Figure 1. Pictures and X-Rays of Individuals 6and 8 with PDE4D Mutations(A1 and B1) Full-face pictures of individuals 6 (A)and 8 (B) showing facial dysostosis with a flatnasal bridge and nasal hypoplasia.(A2 and B2) Profile pictures show malar hypo-plasia.(A3) Palmar face of right hand.(A4 and B3) Dorsal face of hands, which are broadand shortened.(A5 and B4) Standard X-rays of both hands showsevere brachydactyly with short, broad meta-carpals and phalanges, cone-shaped epiphyses(arrows), and advanced carpal maturation.

744 The American Journal of Human Genetics 90, 740–745, April 6, 2012

11. Sutherland, E.W., and Rall, T.W. (1958). Fractionation and

characterization of a cyclic adenine ribonucleotide formed

by tissue particles. J. Biol. Chem. 232, 1077–1091.

12. Jin, S.L., Richard, F.J., Kuo, W.P., D’Ercole, A.J., and Conti, M.

(1999). Impaired growth and fertility of cAMP-specific phos-

phodiesterase PDE4D-deficient mice. Proc. Natl. Acad. Sci.

USA 96, 11998–12003.

13. Graham, J.M., Jr., Krakow, D., Tolo, V.T., Smith, A.K., and

Lachman, R.S. (2001). Radiographic findings and Gs-alpha

bioactivity studies and mutation screening in acrodysostosis

indicate a different etiology from pseudohypoparathyroidism.

Pediatr. Radiol. 31, 2–9.

14. Li, Y.F., Cheng, Y.F., Huang, Y., Conti, M., Wilson, S.P.,

O’Donnell, J.M., and Zhang, H.T. (2011). Phosphodiesterase-

4D knock-out and RNA interference-mediated knock-down

enhance memory and increase hippocampal neurogenesis

via increased cAMP signaling. J. Neurosci. 31, 172–183.

15. Dudai, Y., Jan, Y.N., Byers, D., Quinn, W.G., and Benzer, S.

(1976). dunce, a mutant of Drosophila deficient in learning.

Proc. Natl. Acad. Sci. USA 73, 1684–1688.

The American Journal of Human Genetics 90, 740–745, April 6, 2012 745

REPORT

Exome Sequencing IdentifiesPDE4D Mutations in Acrodysostosis

Hane Lee,1,2 John M. Graham, Jr.,3,9 David L. Rimoin,1,3,4,9 Ralph S. Lachman,5,9 Pavel Krejci,9,10

Stuart W. Tompson,8 Stanley F. Nelson,1,2 Deborah Krakow,1,6,7 and Daniel H. Cohn7,8,*

Acrodysostosis is a dominantly-inherited, multisystem disorder characterized by skeletal, endocrine, and neurological abnormalities. To

identify the molecular basis of acrodysostosis, we performed exome sequencing on five genetically independent cases. Three different

missense mutations in PDE4D, which encodes cyclic AMP (cAMP)-specific phosphodiesterase 4D, were found to be heterozygous in

three of the cases. Two of the mutations were demonstrated to have occurred de novo, providing strong genetic evidence of causation.

Two additional cases were heterozygous for de novo missense mutations in PRKAR1A, which encodes the cAMP-dependent regulatory

subunit of protein kinase A and which has been recently reported to be the cause of a form of acrodysostosis resistant to multiple

hormones. These findings demonstrate that acrodysostosis is genetically heterogeneous and underscore the exquisite sensitivity of

many tissues to alterations in cAMP homeostasis.

Acrodysostosis (MIM 101800), also known as Arkless-

Graham syndrome or Maroteaux-Malamut syndrome, is

a pleiotropic disorder characterized by skeletal, endocrine,

and neurological abnormalities.1,2 Skeletal features include

brachycephaly, midface hypoplasia with a small upturned

nose, brachydactyly, and lumbar spinal stenosis. Endo-

crine abnormalities have been reported and include hypo-

thyroidism and hypogonadism in males and irregular

menses in females (summarized in Butler et al.). Develop-

mental disability is a common finding but is variable in

severity and can be associated with significant behavioral

problems. Most cases are sporadic, and there is evidence

of a paternal age effect,3 suggesting that the phenotype

might result from de novo point mutations. Perhaps as

a result of the developmental disability and/or endocrine

abnormalities, there are only a few examples of dominant

transmission.4,5 Recently, dominant mutations in

PRKAR1A (MIM 188830), which encodes the cyclic AMP

(cAMP)-dependent regulatory subunit of protein kinase A

(PKA), were found in a subset of acrodysostosis cases

resistant to multiple hormones.6 The mutations resulted

in reduced PKA activation by cAMP and led to a reduced

hormone response in multiple tissues.

We studied five sporadic cases, four males and one

female, who had clinical and radiographic phenotypes

(Figure 1, Table 1) consistent with the diagnosis of acrody-

sostosis. A prior publication7 contains additional clinical

details on three of the five cases (International Skeletal

Dysplasia Registry reference numbers R99-101A [case 1 in

Graham et al.7], R99-514A [case 2], and 95-141A [case

R1]). All studies were carried out with informed consent

under a protocol approved by the institutional review

board at Cedars-Sinai Medical Center. To determine the

molecular basis of the phenotype in each case, we per-

formed exome capture and sequence analysis. In three

cases (R02-309A, R06-434A, and R95-141A), DNA from

the unaffected parents was also available at the outset

of the study, and we determined the exome sequences for

the six parents to facilitate the identification of de novo

mutations in these trios. High-molecular-weight genomic

DNA was extracted from either blood or lymphoblastoid

cell lines; the quality of the DNA samples was determined

by a Qubit Fluorometer (Invitrogen) and a Bioanalyzer

(Agilent). For each sample, we prepared the sequencing

library with 3 mg of genomic DNA and used the Agilent

SureSelect Target Enrichment System to construct an

Illumina Paired-End Sequencing library (protocol version

2.0.1). The Agilent SureSelect Human All Exon 50Mb kit

was used for the exome capture. Sequences for each sample

(50 bp paired end) were determined on a single lane of an

Illumina HiSeq2000 instrument, and a total of 82–123

million paired-end reads per sample were generated. We

performed base-calling by using the real-time analysis

(RTA) software provided by Illumina.

We aligned the sequence reads to human reference

genome human_g1k_v37.fasta (downloaded from the

Genome Analysis Toolkit [GATK] resource bundle in

November 2010) by using Novoalign from the Novocraft

Short Read Alignment Package; the adaptor-stripping

and base-quality-calibration options were on. We used

SAMtools version 0.1.15 to sort the aligned BAM files,

and we removed potential PCR duplicates (rmdup) by

1Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; 2Department of Pathology, University of California,

Los Angeles, Los Angeles, CA 90095, USA; 3Department of Pediatrics, University of California, Los Angeles, Los Angeles, CA 90095, USA; 4Department of

Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; 5Department of Radiological Sciences, University of California, Los Angeles,

Los Angeles, CA 90095, USA; 6Department of Obstetrics and Gynecology, University of California, Los Angeles, Los Angeles, CA 90095, USA; 7Department

of Orthopedic Surgery, University of California, Los Angeles, Los Angeles, CA 90095, USA; 8Department of Molecular, Cell, and Developmental Biology,

University of California, Los Angeles, Los Angeles, CA 90095, USA; 9Medical Genetics Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048,

USA; 10Institute of Experimental Biology, Masaryk University and Department of Cytokinetics, Institute of Biophysics AS CR, 61265 Brno, Czech Republic

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.03.004. �2012 by The American Society of Human Genetics. All rights reserved.

746 The American Journal of Human Genetics 90, 746–751, April 6, 2012

using Picard. On average, 88.2% of the reads were uniquely

aligned to the reference genome. The PCR duplication rate

varied between 5.1% and 8.2%, and there was an average

estimated library size of 704 million unique fragments.

The on-target rate, or capture specificity, varied from 60%

to 63.8%. The mean coverage across the captured regions

was 973, and approximately 92% of the targeted bases

were covered by R10 reads for each exome.

We performed local realignment for each sample by

using the GATK ‘‘IndelRealigner’’ tool, and we recalibrated

base qualities by using the GATK ‘‘TableRecalibration’’ tool

according to GATK’s recommendation (Best Practice

Variant Detection with the GATK version 2). Variants

were simultaneously called with the GATK ‘‘Unified

Genotyper’’ tool for all 11 samples (the five cases and six

unaffected parents). Small indels were called with the

‘‘-glm DINDEL’’ option. The dbSNP132 file downloaded

from the GATK resource bundle was used so that the

known SNP positions were annotated in the output

VCF (variant call format) file. Only the variants found

within the protein coding regions of the captured exons

were reported with the –L option. The interval file that

we used is available upon request. Using the GATK

‘‘VariantFiltrationWalker’’ tool, we hard filtered both the

SNPs and INDELs to remove low-quality variants. As sug-

gested by GATK, we used the following parameters for

standard filtration: (1) the clusterWindowSize was 10, (2)

mapping quality of zero was >40, (3) quality by depth

was <5.0, and (4) strand bias was >�0.10.

We annotated the ‘‘PASS’’-ed variants that were not

found at dbSNP132 positions by using SeattleSeqAnnota-

tion version 6.16 (SNPs and INDELs were annotated sepa-

rately). Both NCBI (National Center for Biotechnology

Information) full genes and CCDS (consensus coding

sequence) 2010 gene models were used for the annotation.

Variants present in the 1,000 Genomes Database (March

2010 release) or dbSNP131 as well as those resulting in

synonymous coding changes or found outside the coding

region were removed from further analysis.

The annotated variants were first examined in the trios

and were further filtered under a rare dominant model.

Because acrodysostosis is dominantly inherited and was

sporadic in the cases studied, we prioritized the variants

to examine the de novo variants. We identified potential

de novo variants by selecting the heterozygous variants

found only in the case but not in the parents, and we

Figure 1. Radiographic Phenotype in Acrodysostosis CasesFor four of the cases, anteroposterior hand (A–D), lateral-skull (E–H), and lumbar-spine (I–L) radiographs are shown. Individuals R02-309and R99-101 have mutations in PRKAR1A, and individuals R06-434 and R95-141 have mutations in PDE4D. Arrows on the lateral-skullfilms identify midface hypoplasia. Arrows on the lumbar-spine films indicate absence of normal interpedicular widening in the lumbarvertebrae; the absence of such widening predisposes the affected individuals to spinal stenosis. The case numbers are indicated acrossthe top.

The American Journal of Human Genetics 90, 746–751, April 6, 2012 747

manually inspected the raw reads of these variants to verify

that each was absent from the parental sequences.

Individual R06-434A had two de novo variants, and both

were of good quality (Table 2). Individual R95-141A

also had two potential de novo variants, but one variant

was found in a poor coverage region, and there was insuf-

ficient coverage in the parental samples for this variant

to be reliably called. Two de novo variants (c.682C>G

[p.Gln228Glu] in R06-434A and c.1769A>C [p.Glu590Ala]

in R95-141A) found in these first two individuals were

located in the same gene, PDE4D (RefSeq accession

number NM_001104631.1; MIM 600129), and an addi-

tional PDE4D variant (c. 2018G>A [p.Gly673Asp]) was

identified in a third individual, R99-514. All PDE4D vari-

ants were confirmed by Sanger-sequence analysis of PCR-

amplified fragments, and the unaffected parents were

shown to not carry the changes identified in their

offspring. In the third individual (R99-514), the PDE4D

variant was not found in DNA from the mother, and the

father could not be studied because he is deceased. These

data provide strong genetic evidence that the PDE4D

mutations are causative.

Individual R02-309A had three potential de novo vari-

ants. However, one variant showed evidence that the

same nonreference allele was present in one of the parents

even though it was not called as a variant, leaving two

potential de novo variants in this individual (Table 2).

Both variants were confirmed by Sanger-sequence analysis

of PCR-amplified fragments containing the changes.

One of the de novo variants (c.1004G>C [p.Arg335Pro])

was located in PRKAR1A (RefSeq accession number

NM_002734.3), the gene previously associated with

Table 1. Clinical Findings in the Five Cases of Acrodysostosis

R06-434A R95-141A R99-514 R02-309A R99-101A

Sex female male male male male

Locus PDE4D PDE4D PDE4D PRKAR1A PRKAR1A

Skeletal abnormalities

Short stature no mild mild mild mild

Small hands yes yes yes yes yes

Midface hypoplasia yes yes yes yes yes

Lumbar stenosis unknown yes yes yes yes

Neurological abnormality

Developmental disability no significant mild mild mild

Endocrine abnormalities

Hypothyroidism no no congenital no congenital

Hypogonadism unknown cryptorchidism no no unilateral undescendedtestis

Hearing loss no no no no moderate mixed

Table 2. De Novo Variants Identified by Exome Sequencing in the Five Cases of Acrodysostosis

Individual ChromosomeGenomicPosition

ReferenceSequence

VariantSequence Locus

cDNAPosition

ProteinChange De Novo?

Polyphen-2Prediction

SIFTPrediction

R06-434A 5 58,489,328 G/G G/C PDE4D c.682C>G p.Gln228Glu yes probablydamaging

damaging

R06-434A 7 148,963,588 C/C C/T ZNF783 c.187C>T p.Arg63Cys yes � damaging

R95-141A 5 58,272,238 T/T T/G PDE4D c.1769A>C p.Glu590Ala yes probablydamaging

damaging

R99-514 5 58,270,903 C/C C/T PDE4D c.2018G>A p.Gly673Asp not inmother

probablydamaging

damaging

R02-309A 17 66,526,448 G/G G/C PRKAR1A c.1004G>C p.Arg335Pro yes probablydamaging

damaging

R02-309A 2 175,264,813 T/T T/C SCRN3 c.302T>C p.Leu108Ser yes probablydamaging

tolerated

R99-101 17 66,526,424 T/T T/C PRKAR1A c.980T>C p.Ile327Thr yes probablydamaging

damaging

748 The American Journal of Human Genetics 90, 746–751, April 6, 2012

acrodysostosis with hormone resistance.6 Individual R99-

101 was also found to have a variant (c.980T>C

[p.Ile327Thr]) in PRKAR1A, and subsequent Sanger-

sequence analysis of a PCR-amplified fragment confirmed

the mutation and demonstrated its absence from DNA

derived from the parents; this analysis indicated that the

variant resulted from a de novo event. Therefore, acrody-

sostosis in these latter two individuals appears to have re-

sulted from PRKAR1A mutations.

All five missense variants (three in PDE4D and two in

PRKAR1A) were predicted to be damaging by PolyPhen-2

(Polymorphism Phenotyping version 2) and/or SIFT, two

commonly used tools that predict the functional conse-

quences of amino acid changes on the basis of sequence

homology and the physical properties of the amino acids.

None of these variants were observed in an internal exome

dataset of 48 individuals affected by different medical

conditions, in a group of 250 published exome data-

sets,8,9 or among the 5,379 exomes available from the

National Heart, Lung, and Blood Institute (NHLBI) Exome

Sequencing Project Exome Variant Server (ESP5400).

The findings described here thus demonstrate that acro-

dysostosis can result from missense mutations in PDE4D,

the gene encoding cAMP-dependent phosphodiesterase

4D. PDE4D encodes at least five isoforms that differ at their

amino-terminal ends as a result of alternate transcription

start sites or alternative splicing.10 The encoded proteins

range in size from 508 to 810 amino acids, and the three

longer isoforms contain two highly evolutionarily con-

served upstream regions (UCR1 and UCR2) and the large

catalytic domain. The two shorter isoforms lack the

amino-terminal UCR1 domain, which regulates catalytic

activity along with UCR2.11 The p.Gln228Glu substitution

alters a conserved residue in the UCR1 region, indicating

that disruption of the longer isoforms alone is enough to

cause a phenotypic effect in the target tissues and result

in acrodysostosis. The p.Glu590Ala and p.Gly673Asp sub-

stitutions alter conserved catalytic-domain amino acids,

indicating that these residues are essential for normal

PDE4D activity.

The results of this study also confirm that mutations in

PRKAR1A, which encodes the cyclic AMP-dependent regu-

latory subunit of PKA, can also lead to acrodysostosis.6 The

two substitutions, p.Arg335Pro and p.Ile327Thr, found in

PRKAR1A were different than the recurrent mutation

(p.R368*) previously reported,6 but all three mutations

were in exon 11, which encodes part of the highly

conserved cAMP-binding domain B. Binding of cAMP by

PRKAR1A is required for the release and activation of

PKA (Figure 2), which then phosphorylates and activates

CREB; this process then leads to the expression of down-

stream targets. This suggests that these mutations could

cause reduced cAMP binding and result in reduced PKA

activation and, consequently, reduced downstream signal-

ing. This mechanism would distinguish the acrodysostosis

mutations from the PRKAR1Amutations that cause Carney

Figure 2. cAMP Signaling CascadeLigand binding (represented in this example by PTH, but other ligands and receptors can stimulate cAMP synthesis), activates Gs-a andstimulates cAMP synthesis by adenylate cyclase. The binding of cAMP by PRKAR1A, the cAMP-dependent regulatory subunit, leads tothe dissociation and activation of PKA and the subsequent phosphorylation of cAMP response element binding (CREB), nuclear trans-location, and expression of downstream genes. PDE4D phosphodiesterase activitymodulates cAMP levels.Mutations (indicated by aster-isks) in the genes encoding these three components of the pathway result in a spectrum of clinically related disorders— acrodysostosisfor mutations in PDE4D or PRKAR1A or Albright hereditary osteodystrophy for mutations in GNAS, the gene encoding Gs-a.

The American Journal of Human Genetics 90, 746–751, April 6, 2012 749

complex (the mutations that cause Carney complex

primarily lead to reduced PRKAR1A synthesis, lack of regu-

latory control of PKA activation, and derepression of

CREB-mediated targets).12

The clinical and radiographic phenotypes (summarized

in Table 1) facilitated comparing the acrodysostosis cases

with the typical symptoms associated with either PDE4D

or PRKAR1A mutations. Mild short stature with small

hands was present in all of the cases, including those

with PRKAR1Amutations previously described,6 regardless

of the locus involved. Similarly, stenosis of the lumbar

spine and midface hypoplasia with a small nose were

consistent findings both clinically and radiographically

(Figure 1). However, endocrine abnormalities were vari-

able; hypothyroidism was documented in just two of the

individuals, R99-514 (who had a PDE4D mutation) and

R99-101 (who had a PRKAR1Amutation). Hypothyroidism

persisted in individual R99-101 but spontaneously

resolved in individual R99-514 when he reached three

years of age. However, firm conclusions cannot be made

from these observations because the number of cases

studied thus far is too small. One of the four male individ-

uals with a PDE4D mutation (R95-141A) had cryptorchi-

dism. One of the PRKAR1A individuals described here,

R99-101, exhibited a unilateral undescended testis, and

both of the males previously described6 had cryptorchi-

dism, indicating that hypogonadism can be found in cases

with defects in either gene. From a neurological viewpoint,

four of the five individuals studied had some degree of

developmental disability, and one individual (R95-141A)

displayed significant behavioral problems. Thus, it is diffi-

cult to distinguish acrodysostosis cases with PDE4D muta-

tions from those with PRKAR1A mutations by clinical

observation only.

The acrodysostosis phenotype is similar to that of Pde4d-

knockout mice.13 As in humans with acrodysostosis,

Pde4d-nullmice exhibit reduced growth andmidface hypo-

plasia. Females with acrodysostosis have been reported to

have irregular menses, and knockout mice have reduced

fertility associated with decreased ovulation and oocyte

degeneration. These observations suggest that the human

mutations lead to reduced PDE4D activity. Because the

heterozygous knockout mice were phenotypically normal

and had essentially normal phosphodiesterase activity,13

it appears that haploinsufficiency for PDE4D activity has

no phenotypic consequence. Because PDE4D is a dimer,

the data suggest the possibility that the missense alleles

identified in the acrodysostosis cases might cause the

phenotype via a dominant-negative effect on the protein.

Albright hereditary osteodystrophy (MIM 103580)

shares phenotypic features, including short stature, bra-

chydactyly, hormone resistance, and varying degrees of

developmental disability, with acrodysostosis and results

frommutations in GNAS14 (MIM 139320), the gene encod-

ing the adenylate cyclase activating protein Gs-a. Gs-a,

PDE4D, and PRKAR1A are all components of the cAMP

signaling pathway (Figure 2). The disruption of PRKAR1A

and GNAS causes downregulation of the cAMP signaling

cascade in response to an external signal, such as parathy-

roid hormone (PTH). Although decreased PDE4D activity

might be predicted to increase cAMP levels, it has been sug-

gested13 that inactivation of PDE4D-mediated negative

feedback would cause a permanent desensitization state

of the cAMP signaling pathway; this desensitization would

paradoxically lead to a significant reduction in the cAMP

response. Consequently, the phenotypic effects resulting

from PDE4Dmutations would be similar to those resulting

from PRKAR1A and GNAS defects.

PDE4D is orthologous to Drosophila dunce, which has

been shown to play a role in learning and memory in

flies.15 Flies deficient in dunce have reduced cAMP

phosphodiesterase activity,16 a reduction which results in

defects in both associative and nonassociative memory.17

Although increased branching of terminal neuronal

processes has been observed in dunce larvae (implicating

abnormal brain morphology as an element of the pheno-

type18), alterations that occur in the biochemical process

of memory as a result of altered cAMP levels in the

mushroom body of the Drosophila brain appear to be the

predominant effect of dunce mutations.19 Because most

acrodysostosis cases exhibit significant developmental

disabilities, the data presented here raise the possibility

that PDE4D deficiency disrupts a highly evolutionarily

conserved neurological pathway.

Thus, a variety of genetic defects that alter cAMP metab-

olism produce disorders with a related constellation of

findings, which include short stature with brachydactyly,

endocrine abnormalities, and developmental disability.

However, the precise role of PDE4D in the skeleton, partic-

ularly in growth-plate cartilage, is not well understood.

Loss of cAMP activity as a result of a chondrocyte-specific

knockout of Gs-a revealed severe growth-plate abnormali-

ties, accelerated hypertrophic chondrocyte differentiation

with ectopic cartilage formation, and increased parathy-

roid hormone-related peptide expression in periarticular

chondrocytes.20 Individuals with acrodysostosis have

been reported to exhibit accelerated bone maturation

as well as ectopic bone formation,21,22 supporting the

hypothesis that a component of the cartilage phenotype

might be reduced activity of the cAMP signaling cascade.

It remains to be determined whether modulation of cAMP

levels could ameliorate the phenotypic consequences of

mutations in the pathway in any meaningful way, espe-

cially in the primary target tissues of the skeleton, brain,

and endocrine organs. Understanding the complexity of

cAMP regulation among the affected tissues would be an

important step in achieving this goal.

Acknowledgments

We would like to thank Traci Toy and Bret Harry at the University

of California, Los Angeles (UCLA) DNA Microarray Core for their

assistance with constructing the sequencing libraries and compu-

tational support, Suhua Feng at the UCLA Broad Stem Cell

750 The American Journal of Human Genetics 90, 746–751, April 6, 2012

Research Center for his assistance in running the HiSeq2000

instrument, Lisette Nevarez for assistance with Sanger-sequence

analysis, and both Nancy Kramer and Daniel Gruskin for assis-

tance with the clinical information. This study was supported in

part by the Steven Spielberg Pediatric Research Center at Cedars-

Sinai Medical Center and by National Institutes of Health grant

HD22657.

Received: December 2, 2011

Revised: March 1, 2012

Accepted: March 6, 2012

Published online: March 29, 2012

Web Resources

The URLs for data presented herein are as follows:

Exome Variant Server, http://evs.gs.washington.edu/EVS/

Genome Analysis Toolkit, ftp://gsapubftp-anonymous@ftp.

broadinstitute.org

Novocraft Short Read Alignment Package, http://www.novocraft.

com

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

Picard, http://picard.sourceforge.net/

PolyPhen-2, http://genetics.bwh.harvard.edu/pph2/bgi.shtml

SAMtools, http://samtools.sourceforge.net/

SeattleSeqAnnotation, http://snp.gs.washington.edu/

SeattleSeqAnnotation131/

SIFT, http://sift.jcvi.org/

References

1. Maroteaux, P., and Malamut, G. (1968). Acrodysostosis. Presse

Med. 76, 2189–2192.

2. Robinow, M., Pfeiffer, R.A., Gorlin, R.J., McKusick, V.A.,

Renuart, A.W., Johnson, G.F., and Summitt, R.L. (1971). Acro-

dysostosis. A syndrome of peripheral dysostosis, nasal hypo-

plasia, andmental retardation.Am. J. Dis. Child. 121, 195–203.

3. Jones, K.L., Smith, D.W., Harvey, M.A.S., Hall, B.D., and Quan,

L. (1975). Older paternal age and fresh genemutation: Data on

additional disorders. J. Pediatr. 86, 84–88.

4. Steiner, R.D., and Pagon, R.A. (1992). Autosomal dominant

transmission of acrodysostosis. Clin. Dysmorphol. 1, 201–206.

5. Sheela, S.R., Perti, A., and Thomas, G. (2005). Acrodysostosis:

Autosomal dominant transmission. Indian Pediatr. 42,

822–826.

6. Linglart, A., Menguy, C., Couvineau, A., Auzan, C., Gunes, Y.,

Cancel, M., Motte, E., Pinto, G., Chanson, P., Bougneres, P.,

et al. (2011). Recurrent PRKAR1A mutation in acrodysostosis

with hormone resistance. N. Engl. J. Med. 364, 2218–2226.

7. Graham, J.M., Jr., Krakow, D., Tolo, V.T., Smith, A.K., and

Lachman, R.S. (2001). Radiographic findings and Gs-alpha

bioactivity studies and mutation screening in acrodysostosis

indicate a different etiology from pseudohypoparathyroidism.

Pediatr. Radiol. 31, 2–9.

8. Yi, X., Liang, Y., Huerta-Sanchez, E., Jin, X., Cuo, Z.X., Pool,

J.E., Xu, X., Jiang, H., Vinckenbosch, N., Korneliussen, T.S.,

et al. (2010). Sequencing of 50 human exomes reveals adapta-

tion to high altitude. Science 329, 75–78.

9. Li, Y., Vinckenbosch, N., Tian, G., Huerta-Sanchez, E., Jiang,

T., Jiang, H., Albrechtsen, A., Andersen, G., Cao, H., Kornelius-

sen, T., et al. (2010). Resequencing of 200 human exomes

identifies an excess of low-frequency non-synonymous

coding variants. Nat. Genet. 42, 969–972.

10. Bolger, G.B., Erdogan, S., Jones, R.E., Loughney, K., Scotland,

G., Hoffmann, R., Wilkinson, I., Farrell, C., and Houslay,

M.D. (1997). Characterization of five different proteins

produced by alternatively spliced mRNAs from the human

cAMP-specific phosphodiesterase PDE4D gene. Biochem. J.

328, 539–548.

11. Houslay, M.D., and Adams, D.R. (2003). PDE4 cAMP phospho-

diesterases: Modular enzymes that orchestrate signalling

cross-talk, desensitization and compartmentalization. Bio-

chem. J. 370, 1–18.

12. Bertherat, J., Horvath, A., Groussin, L., Grabar, S., Boikos, S.,

Cazabat, L., Libe, R., Rene-Corail, F., Stergiopoulos, S., Bour-

deau, I., et al. (2009). Mutations in regulatory subunit type

1A of cyclic adenosine 50-monophosphate-dependent protein

kinase (PRKAR1A): Phenotype analysis in 353 patients and 80

different genotypes. J. Clin. Endocrinol. Metab. 94, 2085–

2091.

13. Jin, S.-L.C., Richard, F.J., Kuo, W.-P., D’Ercole, A.J., and Conti,

M. (1999). Impaired growth and fertility of cAMP-specific

phosphodiesterase PDE4D-deficient mice. Proc. Natl. Acad.

Sci. USA 96, 11998–12003.

14. Patten, J.L., Johns, D.R., Valle, D., Eil, C., Gruppuso, P.A.,

Steele, G., Smallwood, P.M., and Levine, M.A. (1990). Muta-

tion in the gene encoding the stimulatory G protein of adeny-

late cyclase in Albright’s hereditary osteodystrophy. N. Engl.

J. Med. 322, 1412–1419.

15. Dudai, Y., Jan, Y.N., Byers, D., Quinn, W.G., and Benzer, S.

(1976). dunce, a mutant of Drosophila deficient in learning.

Proc. Natl. Acad. Sci. USA 73, 1684–1688.

16. Byers, D., Davis, R.L., and Kiger, J.A., Jr. (1981). Defect in cyclic

AMP phosphodiesterase due to the dunce mutation of

learning in Drosophila melanogaster. Nature 289, 79–81.

17. Gong, Z., Xia, S., Liu, L., Feng, C., and Guo, A. (1998). Operant

visual learning and memory in Drosophila mutants dunce,

amnesiac and radish. J. Insect Physiol. 44, 1149–1158.

18. Zhong, Y., Budnik, V., and Wu, C.F. (1992). Synaptic plasticity

in Drosophila memory and hyperexcitable mutants: Role of

cAMP cascade. J. Neurosci. 12, 644–651.

19. Davis, R.L. (1996). Physiology and biochemistry of Drosophila

learning mutants. Physiol. Rev. 76, 299–317.

20. Sakamoto, A., Chen,M., Kobayashi, T., Kronenberg, H.M., and

Weinstein, L.S. (2005). Chondrocyte-specific knockout of the

G protein G(s)alpha leads to epiphyseal and growth plate

abnormalities and ectopic chondrocyte formation. J. Bone

Miner. Res. 20, 663–671.

21. Butler, M.G., Rames, L.J., and Wadlington, W.B. (1988). Acro-

dysostosis: Report of a 13-year-old boy with review of litera-

ture and metacarpophalangeal pattern profile analysis. Am.

J. Med. Genet. 30, 971–980.

22. Becker, S., Mausolf, A., and Laszig, R. (1989). Acrodysostosis:

an autosomal inherited form of peripheral dysostosis. HNO

37, 165–168.

The American Journal of Human Genetics 90, 746–751, April 6, 2012 751

CORRECTION

This Month in Genetics

Kathryn B. Garber*

(The American Journal of Human Genetics 90, 383–384; March 9, 2012)

In the summary titledNew Tools for Interpretation of Newborn-Screening Results, the first author of the paper being discussed

should have been Marquardt,’’ not ‘‘Marquard.’’

Marquardt et al. (2012) Genet Med. Published online February 16, 2012. 10.138/gim.2012.2.

*Correspondence: [email protected]

DOI 10.1016/j.ajhg.2012.03.010. �2012 by The American Society of Human Genetics. All rights reserved.

752 The American Journal of Human Genetics 90, 752, April 6, 2012

ERRATUM

Large-Scale Gene-Centric Meta-Analysis across39 Studies Identifies Type 2 Diabetes Loci

Richa Saxena,* Clara C. Elbers, Yiran Guo, Inga Peter, Tom R. Gaunt, Jessica L. Mega,Matthew B. Lanktree, Archana Tare, Berta Almoguera Castillo, Yun R. Li, Toby Johnson,Marcel Bruinenberg, Diane Gilbert-Diamond, Ramakrishnan Rajagopalan, Benjamin F. Voight,Ashok Balasubramanyam, John Barnard, Florianne Bauer, Jens Baumert, Tushar Bhangale,Bernhard O. Bohm, Peter S. Braund, Paul R. Burton, Hareesh R. Chandrupatla, Robert Clarke,Rhonda M. Cooper-DeHoff, Errol D. Crook, George Davey-Smith, Ian N. Day, Anthonius de Boer,Mark C.H. de Groot, Fotios Drenos, Jane Ferguson, Caroline S. Fox, Clement E. Furlong,Quince Gibson, Christian Gieger, Lisa A. Gilhuijs-Pederson, Joseph T. Glessner, Anuj Goel,Yan Gong, Struan F.A. Grant, Diederick E. Grobbee, Claire Hastie, Steve E. Humphries,Cecilia E. Kim, Mika Kivimaki, Marcus Kleber, Christa Meisinger, Meena Kumari, Taimour Y. Langaee,Debbie A. Lawlor, Mingyao Li, Maximilian T. Lobmeyer, Anke-Hilse Maitland-van der Zee,Matthijs F.L. Meijs, Cliona M. Molony, David A. Morrow, Gurunathan Murugesan, Solomon K. Musani,Christopher P. Nelson, Stephen J. Newhouse, Jeffery R. O’Connell, Sandosh Padmanabhan,Jutta Palmen, Sanjey R. Patel, Carl J. Pepine, Mary Pettinger, Thomas S. Price, Suzanne Rafelt,Jane Ranchalis, Asif Rasheed, Elisabeth Rosenthal, Ingo Ruczinski, Sonia Shah, Haiqing Shen,Gunther Silbernagel, Erin N. Smith, Annemieke W.M. Spijkerman, Alice Stanton,Michael W. Steffes, Barbara Thorand, Mieke Trip, Pim van der Harst, Daphne L. van der A,Erik P.A. van Iperen, Jessica van Setten, Jana V. van Vliet-Ostaptchouk, Niek Verweij,Bruce H.R. Wolffenbuttel, Taylor Young, M. Hadi Zafarmand, Joseph M. Zmuda,the Look AHEAD Research Group, DIAGRAM consortium, Michael Boehnke, David Altshuler,Mark McCarthy, W.H. Linda Kao, James S. Pankow, Thomas P. Cappola, Peter Sever, Neil Poulter,Mark Caulfield, Anna Dominiczak, Denis C. Shields, Deepak L. Bhatt, Li Zhang, Sean P. Curtis,John Danesh, Juan P. Casas, Yvonne T. van der Schouw, N. Charlotte Onland-Moret,Pieter A. Doevendans, Gerald W. Dorn II, Martin Farrall, Garret A. FitzGerald,Anders Hamsten Robert Hegele, Aroon D. Hingorani, Marten H. Hofker, Gordon S. Huggins,Thomas Illig, Gail P. Jarvik, Julie A. Johnson, Olaf H. Klungel, William C. Knowler, Wolfgang Koenig,Winfried Marz, James B. Meigs, Olle Melander, Patricia B. Munroe, Braxton D. Mitchell,Susan J. Bielinski, Daniel J. Rader, Muredach P. Reilly, Stephen S. Rich, Jerome I. Rotter,Danish Saleheen, Nilesh J. Samani, Eric E. Schadt, Alan R. Shuldiner, Roy Silverstein,Kandice Kottke-Marchant, Philippa J. Talmud, Hugh Watkins, Folkert W. Asselbergs,Paul I.W. de Bakker, Jeanne McCaffery, Cisca Wijmenga, Marc S. Sabatine, James G. Wilson,Alex Reiner, Donald W. Bowden, Hakon Hakonarson, David S. Siscovick, and Brendan J. Keating*

The American Journal of Human Genetics, 90, 410–425; March 2012

The originally published online version of this paper omitted two authors, Peter Sever and Neil Poulter, who have now

been added. Middle initials have also been added for Deepak L. Bhatt and Folkert W. Asselbergs. In addition, the ASCOT

and INVEST portions of the Supplemental Acknowledgments have been updated. The authors regret the errors.

*Correspondence: [email protected] (R.S.), [email protected] (B.J.K.)

DOI 10.1016/j.ajhg.2012.03.001. �2012 by The American Society of Human Genetics. All rights reserved.

The American Journal of Human Genetics 90, 753, April 6, 2012 753