9
Structural variants caused by Alu insertions are associated with risks for many human diseases Lindsay M. Payer a,1 , Jared P. Steranka a , Wan Rou Yang a , Maria Kryatova a , Sibyl Medabalimi a , Daniel Ardeljan a,b , Chunhong Liu a , Jef D. Boeke b,c,d,e,f,1 , Dimitri Avramopoulos b,g , and Kathleen H. Burns a,b,d,f,1 a Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205; b McKusickNathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205; c Institute for Systems Genetics, New York University Langone School of Medicine, New York, NY 10016; d Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205; e Department of Molecular Biology & Genetics, Johns Hopkins University School of Medicine, Baltimore, MD 21205; f High Throughput Biology Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205; and g Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21205 Contributed by Jef D. Boeke, March 22, 2017 (sent for review November 4, 2016; reviewed by Mark A. Batzer and Jeffrey Han) Interspersed repeat sequences comprise much of our DNA, al- though their functional effects are poorly understood. The most commonly occurring repeat is the Alu short interspersed element. New Alu insertions occur in human populations, and have been responsible for several instances of genetic disease. In this study, we sought to determine if there are instances of polymorphic Alu insertion variants that function in a common variant, common dis- ease paradigm. We cataloged 809 polymorphic Alu elements map- ping to 1,159 loci implicated in disease risk by genome-wide association study (GWAS) (P < 10 -8 ). We found that Alu insertion variants occur disproportionately at GWAS loci (P = 0.013). More- over, we identified 44 of these Alu elements in linkage disequilib- rium (r 2 > 0.7) with the trait-associated SNP. This figure represents a >20-fold increase in the number of polymorphic Alu elements associated with human phenotypes. This work provides a broader perspective on how structural variants in repetitive DNAs may con- tribute to human disease. Alu | structural variant | GWAS | causative variant | interspersed repeats W e understand the function of only a small fraction of our genome. Protein-coding gene exons account for about 1% of our DNA (1). Although transacting factors and modified histone positions have been identified for other regions of DNA (e.g., ref. 2), researchers have identified the function of few noncoding sequences. The most frequently occurring sequences are perhaps the least well understood. Interspersed repeats derived from mobile DNAs comprise 45% of our genome (1, 3). These sequences are often presumed to be nonfunctional junk DNA,with rare exceptions causing genetic disease. The first known mobile element insertion causing disease was a long interspersed element-1 (LINE-1) in- sertion interrupting a coding exon of the coagulation factor VIII gene (FVIII) to cause hemophilia A (4). Insertions of other transposons have also been reported in FVIII in patients with hemophilia A (e.g., refs. 57). In all, 124 cases of genetic disease have been attributed to de novo interspersed repeat insertions, including 76 cases caused by Alu short interspersed elements (SINEs) and 30 cases attributed to LINE-1 insertions (8). These insertions compromise gene function by disrupting an exon or interfering with mRNA splicing. In each instance, the mobile el- ement insertion severely affects allele function and leads to an overt phenotype. It has remained an open question as to whether common in- terspersed repeat variants also affect human health. Large num- bers of interspersed repeat insertion variants that would be candidates for such an effect have been identified in recent years (e.g., refs. 911). Polymorphic interspersed repeats include LINE-1 sequences that copy and insert into new genomic loci using LINE-1encoded reverse transcriptase (12, 13). LINE-1 also mediates the retrotransposition of other interspersed repeats, including Alu SINE (14) and SINE-variable number tandem re- peat (VNTR)-Alu elements (15, 16). These interspersed repeats variants reflect a major source of genetic diversity in humans. Over 18,000 common polymorphic transposable elements have been mapped (e.g., ref. 9), and it is estimated that >60,000 exist (17, 18). To test the hypothesis that a subset of common transposable element polymorphisms has implications for human health, we focused on genomic intervals related to disease risk by genome- wide association study (GWAS). GWAS uses large numbers of cases and controls to approximate the locations of genetic variants that predispose to disease. We found unexpected numbers of Alu insertion polymorphisms residing within these GWAS intervals. At 44 loci, we demonstrate linkage disequilibrium (LD) between the Alu variant and trait-associated SNPs (TASs) identified by GWAS, thus providing genetic evidence that the Alu insertion is one of the candidate causative variants at each locus. Results Polymorphic Alu Elements Are Common Structural Variants. Alu in- sertions are 300-bp bipartite, primate-specific interspersed re- peats derived from 7SL RNA (19). There are over 1.1 million Alu copies in the human genome (1, 3). Of these elements, a small subset is polymorphic in the population such that both insertion and preinsertion (empty) alleles are present (e.g., refs. 9, 2022). These elements are of special interest because they represent in- sertions that occurred relatively recently. Significance Repetitive sequences comprise a large portion of the genome and are often thought of as junk DNA.They are a significant source of genetic variation, particularly Alu elements. Their functional consequence is frequently dismissed. Here, we test the hypothesis that Alu polymorphisms contribute to phenotypic differences between individuals. We identified an enrichment of Alu polymorphisms in regions of the genome associated with human disease risk. Further, we find 44 instances where the trait- associated SNP is a surrogate for presence or absence of an Alu insertion. This finding indicates that the Alu may be the variant effecting disease risk, an intriguing possibility given its size and regulatory potential. This work emphasizes the importance of considering repeat polymorphisms in functional variant analysis. Author contributions: L.M.P., J.D.B., and K.H.B. designed research; L.M.P., J.P.S., W.R.Y., M.K., and S.M. performed research; D. Avramopoulos contributed new reagents/analytic tools; L.M.P., J.P.S., W.R.Y., M.K., D. Ardeljan, C.L., D. Avramopoulos, and K.H.B. analyzed data; and L.M.P., D. Ardeljan, C.L., J.D.B., and K.H.B. wrote the paper. Reviewers: M.A.B., Louisiana State University; and J.H., Tulane University School of Medicine. J.H. was a PhD trainee of J.D.B. (graduation date 2005). J.H. was a middle author on a single paper from J.D.B. in the past 4 years, on an unrelated project; its publication was delayed until 2014. All other authors declare no conflict of interest. 1 To whom correspondence may be addressed. Email: [email protected], lhorvat1@jhmi. edu, or [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1704117114/-/DCSupplemental. E3984E3992 | PNAS | Published online May 2, 2017 www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Downloaded by guest on March 29, 2020

Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

Structural variants caused by Alu insertions areassociated with risks for many human diseasesLindsay M. Payera,1, Jared P. Sterankaa, Wan Rou Yanga, Maria Kryatovaa, Sibyl Medabalimia, Daniel Ardeljana,b,Chunhong Liua, Jef D. Boekeb,c,d,e,f,1, Dimitri Avramopoulosb,g, and Kathleen H. Burnsa,b,d,f,1

aDepartment of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205; bMcKusick–Nathans Institute of Genetic Medicine, JohnsHopkins University School of Medicine, Baltimore, MD 21205; cInstitute for Systems Genetics, New York University Langone School of Medicine, New York,NY 10016; dSidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205; eDepartment of MolecularBiology & Genetics, Johns Hopkins University School of Medicine, Baltimore, MD 21205; fHigh Throughput Biology Center, Johns Hopkins University Schoolof Medicine, Baltimore, MD 21205; and gPsychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21205

Contributed by Jef D. Boeke, March 22, 2017 (sent for review November 4, 2016; reviewed by Mark A. Batzer and Jeffrey Han)

Interspersed repeat sequences comprise much of our DNA, al-though their functional effects are poorly understood. The mostcommonly occurring repeat is the Alu short interspersed element.New Alu insertions occur in human populations, and have beenresponsible for several instances of genetic disease. In this study,we sought to determine if there are instances of polymorphic Aluinsertion variants that function in a common variant, common dis-ease paradigm. We cataloged 809 polymorphic Alu elements map-ping to 1,159 loci implicated in disease risk by genome-wideassociation study (GWAS) (P < 10−8). We found that Alu insertionvariants occur disproportionately at GWAS loci (P = 0.013). More-over, we identified 44 of these Alu elements in linkage disequilib-rium (r2 > 0.7) with the trait-associated SNP. This figure representsa >20-fold increase in the number of polymorphic Alu elementsassociated with human phenotypes. This work provides a broaderperspective on how structural variants in repetitive DNAs may con-tribute to human disease.

Alu | structural variant | GWAS | causative variant | interspersed repeats

We understand the function of only a small fraction of ourgenome. Protein-coding gene exons account for about 1%

of our DNA (1). Although transacting factors and modifiedhistone positions have been identified for other regions of DNA(e.g., ref. 2), researchers have identified the function of fewnoncoding sequences.The most frequently occurring sequences are perhaps the least

well understood. Interspersed repeats derived from mobile DNAscomprise ∼45% of our genome (1, 3). These sequences are oftenpresumed to be nonfunctional “junk DNA,” with rare exceptionscausing genetic disease. The first known mobile element insertioncausing disease was a long interspersed element-1 (LINE-1) in-sertion interrupting a coding exon of the coagulation factor VIIIgene (FVIII) to cause hemophilia A (4). Insertions of othertransposons have also been reported in FVIII in patients withhemophilia A (e.g., refs. 5–7). In all, 124 cases of genetic diseasehave been attributed to de novo interspersed repeat insertions,including 76 cases caused by Alu short interspersed elements(SINEs) and 30 cases attributed to LINE-1 insertions (8). Theseinsertions compromise gene function by disrupting an exon orinterfering with mRNA splicing. In each instance, the mobile el-ement insertion severely affects allele function and leads to anovert phenotype.It has remained an open question as to whether common in-

terspersed repeat variants also affect human health. Large num-bers of interspersed repeat insertion variants that would becandidates for such an effect have been identified in recent years(e.g., refs. 9–11). Polymorphic interspersed repeats include LINE-1sequences that copy and insert into new genomic loci usingLINE-1–encoded reverse transcriptase (12, 13). LINE-1 alsomediates the retrotransposition of other interspersed repeats,including Alu SINE (14) and SINE-variable number tandem re-peat (VNTR)-Alu elements (15, 16). These interspersed repeats

variants reflect a major source of genetic diversity in humans. Over18,000 common polymorphic transposable elements have beenmapped (e.g., ref. 9), and it is estimated that >60,000 exist (17, 18).To test the hypothesis that a subset of common transposable

element polymorphisms has implications for human health, wefocused on genomic intervals related to disease risk by genome-wide association study (GWAS). GWAS uses large numbers ofcases and controls to approximate the locations of genetic variantsthat predispose to disease. We found unexpected numbers of Aluinsertion polymorphisms residing within these GWAS intervals. At44 loci, we demonstrate linkage disequilibrium (LD) between theAlu variant and trait-associated SNPs (TASs) identified byGWAS, thus providing genetic evidence that the Alu insertion isone of the candidate causative variants at each locus.

ResultsPolymorphic Alu Elements Are Common Structural Variants. Alu in-sertions are ∼300-bp bipartite, primate-specific interspersed re-peats derived from 7SL RNA (19). There are over 1.1 million Alucopies in the human genome (1, 3). Of these elements, a smallsubset is polymorphic in the population such that both insertionand preinsertion (empty) alleles are present (e.g., refs. 9, 20–22).These elements are of special interest because they represent in-sertions that occurred relatively recently.

Significance

Repetitive sequences comprise a large portion of the genomeand are often thought of as “junk DNA.” They are a significantsource of genetic variation, particularly Alu elements. Theirfunctional consequence is frequently dismissed. Here, we testthe hypothesis that Alu polymorphisms contribute to phenotypicdifferences between individuals. We identified an enrichment ofAlu polymorphisms in regions of the genome associated withhuman disease risk. Further, we find 44 instances where the trait-associated SNP is a surrogate for presence or absence of an Aluinsertion. This finding indicates that the Alu may be the varianteffecting disease risk, an intriguing possibility given its size andregulatory potential. This work emphasizes the importance ofconsidering repeat polymorphisms in functional variant analysis.

Author contributions: L.M.P., J.D.B., and K.H.B. designed research; L.M.P., J.P.S., W.R.Y.,M.K., and S.M. performed research; D. Avramopoulos contributed new reagents/analytictools; L.M.P., J.P.S., W.R.Y., M.K., D. Ardeljan, C.L., D. Avramopoulos, and K.H.B. analyzeddata; and L.M.P., D. Ardeljan, C.L., J.D.B., and K.H.B. wrote the paper.

Reviewers: M.A.B., Louisiana State University; and J.H., Tulane University School ofMedicine.

J.H. was a PhD trainee of J.D.B. (graduation date 2005). J.H. was a middle author on asingle paper from J.D.B. in the past 4 years, on an unrelated project; its publication wasdelayed until 2014. All other authors declare no conflict of interest.1To whom correspondence may be addressed. Email: [email protected], [email protected], or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1704117114/-/DCSupplemental.

E3984–E3992 | PNAS | Published online May 2, 2017 www.pnas.org/cgi/doi/10.1073/pnas.1704117114

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 2: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

To identify polymorphic retrotransposon copies that couldaffect disease risk, we focused on those copies mapping to lo-cations already associated with disease phenotypes by GWAS. Asa pilot study, we set out to catalog polymorphic Alu elementsnear GWAS signals. We mapped retrotransposon insertionvariants using ligation-mediated PCR to amplify insertion sitesfollowed by microarray analysis, transposon insertion profiling bymicroarray (TIP-chip) (23). By using conditions specific forAluYa5/8 and AluYb8/9 detecting these subfamilies, we avoidedthe evolutionarily older Alu subfamilies that are largely invariantin the genome (i.e., homozygously present in all people). ThesePCR products were fluorescently labeled and hybridized tocustom genomic tiling microarrays encompassing 160-kb regionsaround 825 TASs (Dataset S1). We performed TIP-chip on ge-nomic DNA samples from nine individuals from the Centred’Étude du Polymorphisme Humain collection of Utah residentswith ancestry from Northern and Western Europe (CEU) andcompared their retrotransposon complements. This analysisrevealed 80 polymorphic Alu elements mapping near GWAS-identified TASs; of these elements, 16 mapped within the LDblock defined by the TAS and proxy SNPs with a correlationcoefficient (r2) > 0.8 (Dataset S2). Thus, common Alu insertionpolymorphisms could be readily identified in positions of thegenome with importance in disease.In parallel with these studies, more Alu structural variants were

defined by the 1000 Genomes Project and others, obviating theneed for us to profile common variants in small numbers ofsamples (e.g., refs. 9, 11). Due to the pace of these discovery ef-forts, no comprehensive list of all reported polymorphisms hasbeen developed for Alu variants. Therefore, we collated reports of

previously identified polymorphic Alu elements (10, 11, 21, 22, 24–27) to generate a catalog of 13,572 Alu variants distributedthroughout the genome. These reports underscore that Alu poly-morphisms are a major source of genetic diversity in humans.

Polymorphic Alu Elements Are Enriched at GWAS Loci. To identifythose Alu variants potentially involved in disease, we determinedintervals of the genome where variants could be tagged by TAS;each TAS marks a region in LD that contains other variants forwhich the TAS could serve as a proxy. We determined the LDblock around each TAS based on the physical position of SNPswith pairwise correlation coefficients with a TAS ≥ 0.8. Weidentified polymorphic Alu elements contained within LD blocksdefined by all GWAS TASs with P < 10−9. We excluded allGWAS signals and Alu elements at the HLA locus, given theextensive haplotypes in this region. We found 625 polymorphicAlu elements that map to 899 GWAS signals (Fig. 1); ∼17% ofGWAS signals included in our analysis have a polymorphic Aluelement within the identified LD block.We next wanted to look at the distribution of these 625 poly-

morphic Alu elements. Relative to all polymorphic Alu elements,those elements mapping to GWAS signals are enriched withingenes [odds ratio (OR) = 1.93, 95% confidence interval (CI) =1.67–2.24; P = 5.91−19] and within 10 kb of genes (OR = 2.54,95% CI = 1.98–2.70; P = 1.82−10). This finding is expected, be-cause TASs are also enriched in genes (OR = 1.79, 95% CI =1.65–1.93; P < 1−47) and within 10 kb of genes (OR = 4.44, 95%CI = 3.93–5.01; P = 1.32−106).If polymorphic Alu insertions often contribute to altered dis-

ease risk, we might expect an enrichment of these elements inthe TAS LD blocks. To investigate this possibility, we compared

1 10 11

12 13 14 15 16 17 18 19

2

20 21 22

3 4 5 6 7 8 9

*

Digestive system diseaseCardiovascular diseaseMetabolic diseaseImmune system diseaseNervous system disease

Liver enzyme measurementLipid or lipoprotein measurementInflammatory measurementHematological measurementBody weights and measuresCardiovascular measurementOther measurement

Response to drugBiological process

CancerOther disease

Other trait

Fig. 1. Polymorphic Alu elements map to GWAS loci. GWAS signals (P ≤ 10−9) with at least one polymorphic Alu element mapping to the TAS-defined LDblock are represented by a circle, with color based on the phenotype reported in the GWAS. Figure modified from ref. 75. Elements mapping to sex chro-mosomes and the HLA locus (*) were excluded from the diagram. Details of GWAS loci and polymorphic Alu elements are provided in Dataset S3.

Payer et al. PNAS | Published online May 2, 2017 | E3985

GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 3: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

the number of polymorphic Alu elements contained within TASLD blocks with the number expected by chance by considering1,000 random LD blocks with characteristics that mirror the TASLD blocks. All reported GWAS signals (P ≤ 10−9) were reducedto a list of 3,242 nonoverlapping TAS LD blocks, with 625 poly-morphic Alu elements occurring in these regions (Figs. 1 and 2A).To determine the rate of occurrence expected by chance,1,000 random sets of 3,242 LD blocks each were generated. Tocontrol for nonrandom distribution of GWAS signals in the ge-nome, the random LD blocks were based on random SNPs se-lected to mirror TASs. Because TASs are common variantsenriched near genes, we considered both the minor allele fre-quency (MAF) and distance from the nearest gene when selectingthe random SNPs. Each random SNP selected matched a TAS inMAF (within bins of 5%) and distance to the nearest gene (withingene, <10 kb, or >10 kb). These random SNPs were used todefine random LD blocks (SI Materials and Methods) that weconfirmed are of similar size to TAS LD blocks (Fig. 2B). Thus,although Alu elements modestly influence recombination ratesand local LD structure (28), we did not need to correct for thiseffect by imposing selection on LD block size. Each set of3,242 blocks was considered to be a single iteration, and thenumber of times polymorphic Alu elements mapped to theserandom LD blocks in the iteration was recorded. We needed toperform 1,000 iterations to see several instances of >625 Aluinsertion variants corresponding to these intervals by chance (Fig.2A). Thus, the observed number of Alu variants in proximity toGWAS signals is significantly greater than expected (P = 0.013;Fig. 2A). This finding supports the hypothesis that these elementsmay have functional impacts detected at multiple GWAS loci.

Alu Insertions Are Common Variants at Many GWAS Loci. GWASdepends on a causative variant that is a frequent allele (i.e., acommon variant). Because allele frequencies for many Alu in-sertion polymorphisms in our catalog were unknown, we devel-oped PCR assays to genotype them in reference DNA samples.

We designed PCR primers to amplify across the reported in-sertion sites of 576 Alu variants near a TAS. To identify commonvariants, we first screened pools of genomic DNA samples from apanel of unrelated CEU individuals. We electrophoresed PCRproducts on agarose gels to identify loci producing two fragmentsindicating the filled (Alu-containing) and empty (preinsertion)alleles. Of the 576 reported polymorphic Alu elements, we wereable to detect 174 in the pooled DNA samples.To determine the allele frequencies of the Alu insertions at these

174 loci, we genotyped individual samples from the CEU HapMapreference panel (Dataset S4). The detected Alu insertion allele fre-quency ranged from 0.005 to 0.994, with an average of 0.351 (Fig. S1).

Some Alu Insertions Are in LD with TASs. We next evaluated thegenetic relationship between each polymorphic Alu element andits corresponding TAS(s). We expected Alu insertions responsiblefor human phenotypes to exist in LD with the TAS, with the twogenotypes highly correlated (i.e., high correlation coefficient).To measure LD between the Alu and TAS, we used SNP gen-

otypes for the same panel of 90 CEU samples from the HapMapproject (29). We found little to no correlation between the twovariants (r2 < 0.4) at 115 of the polymorphic Alu loci. For many,the correlation coefficient approaches zero, indicating that theTAS and polymorphic Alu element alleles are essentially in-dependently segregating (Fig. 3A). The GWAS signal at these lociis apparently unrelated to the Alu insertion variant.Of the remaining 58 polymorphic Alu elements, some LD was

observed, with r2 > 0.4. These elements included 18 Alu variantsthat are in moderate LD, as defined by r2 values between 0.4 and0.7, with at least one TAS. Correlation coefficients in this rangeindicate that the SNP and Alu are not perfect proxies for eachother. However, high normalized coefficient of linkage disequi-librium (D′) values (>0.8) at 16 of these loci indicate that theircorrelations are weakened primarily by differences in allele fre-quency. In other words, the less frequent variant (Alu or TAS) tendsto be on the same strand as one version of the more commonvariant (Fig. 3B). In some cases (n = 6), we could identify a betterproxy SNP than the TAS for the Alu that was directly genotyped inthe GWAS. These SNPs did not show phenotype association asstrong as the TAS, and thus we infer that the Alu is not a strongcandidate for the causative variant. An example is age-relatedmacular degeneration risk, which has been mapped to 1q31.3 bythe TAS rs1831282 (P = 9 × 10−24) (30) (Fig. S2). Differences in theMAF of the TAS and Alu reduce the ability of the TAS to serve as aproxy for effects of the polymorphic Alu element; the TAS MAF is0.42, and the Alu MAF is 0.23. However, the variants are in someLD (r2 = 0.4, D′ = 0.81). When present, the Alu is most often on thesame strand as the C allele of rs1831282 (Fig. S2). Using individual-level genotyping data for these GWAS patients, we imputed, orinferred, the Alu genotype in each person and repeated the asso-ciation analysis for this locus. The Alu variant at 1q31.3 is notas highly associated with age-related macular degeneration asrs1831282 (Fig. S2) and, instead, behaves similar to its proxy SNP.In other cases, a single GWAS genotyping platform was not clearlyidentified (n = 10) or did not incorporate a strong proxy SNP for theAlu (n = 2); therefore, these cases may merit further follow-up. Anexample occurs at 4p16.1, where GWAS identified a haplotypeassociated with urate levels (P = 1 × 10−9) (31) (Fig. 3B). The TAShas a MAF of 0.34, whereas the polymorphic Alu element has aMAF of 0.49. Higher urate levels associate with the major haplotype(allele frequency = 0.66). When the Alu is present, it is consistentlypart of the major haplotype, but the preinsertion allele is found withboth SNP haplotypes (Fig. 3B).

Identification of a Known Alu Variant Associated with Angiotensin-Converting Enzyme Levels. There is one polymorphic Alu elementwith a well-established functional association. It maps to theangiotensin-converting enzyme gene (ACE) locus (32, 33). The

B

A

Number of Overlaps

Den

sity

500 550 600 6500

0.005

0.010

0.015

Den

sity

Distance (bp)0 250,000 750,000500,000 1,000,000

0

5e-6

1e-5

1.5e-5

p=0.013

Fig. 2. Alu variants are enriched at GWAS signals. (A) Alu variants map toTAS LD blocks 625 times (red). To determine if this number is attributable tochance, 1,000 iterations of random LD blocks mirroring TAS LD blocks weregenerated. The distribution of the number of times Alu variants overlap withthese random blocks is shown (black). (B) Random LD blocks (gray) closelymirror TAS LD blocks (red) in size.

E3986 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al.

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 4: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

287-bp Alu variant is located between exons 15 and 16 in anantisense orientation with respect to ACE (Fig. 3C). The Alugenotype is strongly associated with serum-immunoreactive ACEconcentrations (32, 33) and detected in a GWAS of ACE activity(34) by the proxy TAS rs4343, which is in complete LD with theAlu (r2 = 1) (Fig. 3C).

Alu Elements Are Tagged by TASs. In total, we identified 44 poly-morphic Alu elements at 77 GWAS loci highly correlated withthe TAS (r2 > 0.7; Fig. 4). In these cases, the TAS serves as aperfect, or near-perfect, proxy for the polymorphic Alu element;thus, the Alu insertion could be considered a candidate causativevariant. We note that these insertions and associated SNPs areall common variants, and their impact on disease risk is small.These 44 loci have been associated with a diverse group ofphenotypes through some of the most highly significant GWASsconducted to date (Table 1). Our results indicate that poly-morphic Alu elements are candidate causative variants for a widerange of conditions with significant impacts on human health,including multiple sclerosis (1p13.1: CD58), obesity (2p25), acutelymphoblastic leukemia (ALL) (10q21.2: ARID5B), psoriasis(12q13.3: STAT2), and four breast cancer risk loci (8q23.21:

CASC8, 12p11.22: PTHLH, 2q35: TCF4, and 6p23: RANBP9),among others (Table 1). We found the insertion allele associatedwith both the protective bias and risk without significant bias(P = 0.275; Dataset S3).Of the 44 Alu polymorphisms in strong LD with TASs, 23 map

to intronic regions, with half in each orientation relative to thegene (sense vs. antisense). The remaining 21 insertions in LDwith TASs are intergenic. Of these insertions, 16 Alu elementsare upstream of the nearest gene, whereas five are downstreamof the nearest gene (location bias binomial test, P = 0.027). Thisbias is likely influenced by the enrichment of TASs upstream ofgenes (35) and is also evident when we consider all intergenicpolymorphic Alu elements mapping to TAS LD blocks (OR =1.28, 95% CI = 1.03–1.59; P = 0.02). The range of distancesbetween upstream Alu variants in LD with TAS and the nearestprotein coding gene is broad (1–683 kb) without clustering in theproximal promoter sequence.Alu variants that are in strong LD with TASs mirror the

characteristics of all polymorphic Alu elements in the genome.These Alu elements are all full-length, ∼300 bp (Dataset S5). Wecompared the sequence content of Alu elements by consideringtheir subfamily assignments. The same subfamilies are repre-sented in similar proportions for all polymorphic Alu elements(36), those Alu elements that map to GWAS signals, and the Aluvariants in LD with TASs (Fig. S3).One of the strongest GWAS signals we associated with an Alu

insertion variant is at 2p25.3 (Fig. 5A). This region is an im-portant genetic determinant for weight, obesity, and body massindex (best P = 3 × 10−49) (e.g., ref. 37). Although the effect hasbeen attributed to the transmembrane protein 18 (TMEM18), atranscriptional repressor that sequesters sequences at the mem-brane, there is debate concerning whether TMEM18 is involvedin obesity regulation through expression in adipose tissue or thecentral nervous system (38). The causative variant(s) remainunclear even after targeted resequencing efforts that focusedmainly on the TMEM18 coding sequence (39–41); it is presumedthat the risk haplotype contains a noncoding regulatory variant.Seven TASs occur over a 23.5-kb interval, mapping ∼15–70 kbdownstream of TMEM18 (e.g., refs. 37, 42–46). The risk haplo-type is defined by TASs rs2867125-C, rs6711012-C, rs2903492-A,rs12463617-C, rs6548238-C, rs7561317-G, and rs10189761-A,with an overall allele frequency of ∼0.87 in the GWAS pop-ulations. We found a 306-bp polymorphic AluYa5 that maps

AEmpty 0.237

Haplotypes

LD supported

ROS1 DCBLD1Variants

AluAAluC

Haplotypes

0.2670.273

*CEmpty 0.223

*

10 kb No LD

r2=0.01D’=0.11

10 kb

r2=0.50D’=0.84

SLC2A9SLC2A9

VariantsEmpty G 0.340

CEmpty 0.170 *CAlu 0.490

AAlu

GG

EmptyA 0.472

0.528*r2=1.00D’=1.00

6p22.1

4p16.1

17q23.3 ACE

ACEKCNH6KCNH6

CYB561 CYB561

SINEs

SNPs

10 kb

Complete LD

Polymorphic Alu elements

*

A

B

C

Fig. 3. Genetic relationships between TASs and polymorphic Alu elementsdistinguish functional variant candidates. (A) There is no LD between thepolymorphic Alu element and the lung cancer TAS, rs9387478 (P = 10−10) at6p22.1. The Alu element is just as frequently found with the risk haplotype (*)as with the protective haplotype. (B) Moderate LD between a polymorphic Aluelement and a TAS associated with urate levels (P = 3 × 10−9) makes thisvariant a potential functional variant. Although the MAF differs between theAlu variant and the TAS, when present, Alu is consistently on the major hap-lotype strand. (C) Good functional variant candidate. There are many SINEsacross the locus, but only one polymorphic Alu (red) has been identified at thislocus; the polymorphic Alu element occurs in ACE. The LD structure is shownand generated by pairwise comparison between variants (SNPs and Alu ele-ment), where red indicates LD. The GWAS LD block associated with the TASs(blue) and Alu variant (red) is bracketed and shown by a red horizontal line.There is complete LD (r2 = 1) between the Alu element and rs4343, the SNPassociated with human serum ACE levels in a GWAS (P = 3 × 10−25). The emptyallele is on the risk (*) haplotype.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

0.00

0.25

0.50

0.75

1.00

Chromosome

corr

elat

ion

coef

ficen

t for

TA

S a

nd p

olym

orph

ic A

lu

Strong LD44 different Alu

variants

Moderate LD18 different Alu

variants

Fig. 4. Forty-four Alu variants are in strong LD with TAS(s). LD results for allpairwise comparisons between polymorphic Alu elements and their corre-sponding TASs mapped across the genome are shown. Strong LD (r2 > 0.7)indicates the best functional candidates, and falls above the red line. There are44 of these insertions, also shown in Table 1. We also defined a subset ofpolymorphic Alu elements imperfectly correlated with nearby TASs (0.4< r2 <0.7; n = 16) (above the gray line).

Payer et al. PNAS | Published online May 2, 2017 | E3987

GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 5: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

between the TASs and TMEM18 and is antisense relative toTMEM18. We determined the allele frequency of the Alu in-sertion to be 0.12. It is in perfect (r2 = 1, D′ = 1) or near-perfect(r2 = 0.907) LD with each of the TASs and corresponds to theprotective haplotype (Fig. 5A). Therefore, the polymorphic Aluelement or a variant traveling on the same strand reduces obesityrisk by an unknown mechanism.We observed a similar scenario at a meningococcal disease

risk locus on 1q31.3 (P = 5 × 10−13) (47) (Fig. 5B). The causativebacteria, Neisseria meningitides, colonizes a portion of the pop-ulation asymptomatically, but can cause sepsis and meningitiswith high mortality rates in susceptible individuals (e.g., refs. 48,49). Genes at 1q31.3 are involved in complement activation, a

pathway that has been associated with meningococcal diseasesusceptibility (e.g., refs. 50, 51). However, the causative variantat this locus is unknown. The TAS (47) maps to an intron ofcomplement factor H-related 3 (CFHR3) and ∼27 kb upstreamof the transcription start site for CFHR1 (Fig. 5B). We identifieda 314-bp AluYb8 on the same strand as CFHR3, located 953 bpfrom the TAS. The TAS rs426736 and Alu element are incomplete LD (r2 = 1, D′ = 1). The TAS risk allele, T, and theAlu-containing allele are part of the same haplotype; the Aluinsertion is on the risk haplotype.

Imputed Alu Variants Are Highly Associated with Disease Risk. Toconfirm that Alu tagged by TASs associate with disease pheno-types, we used available individual-level genotyping data from

Table 1. Polymorphic Alu elements in strong LD (r2 > 0.7) with GWAS TAS

Region Disease or trait SNP P value OR r2

1p13.1 Multiple sclerosis* rs1335532 3E-16 1.22 1.0001q23.1 Red blood cell traits rs857684 4E-16 — 0.8111q31.3 Meningococcal disease rs426736 5E-13 1.59 1.0002p16.1 Venous thromboembolism rs1367228 2E-09 1.49 0.7692p25.3 Obesity* rs2867125 3E-49 — 1.0002q24.2 Bilirubin levels* rs2667011 2E-13 — 0.8042q33.1 Crohn’s disease rs6738825 4E-09 1.06 0.9202q35 Breast cancer rs16857609 1E-15 1.08 0.9533p21.1 Osteoarthritis rs11177 5E-09 1.09 0.914

Major mood disorders rs2251219 2E-09 1.14 0.9573q21.3 Monocyte count rs2712381 2E-16 — 0.8773q25.32 Height rs2362965 2E-09 1.12 1.0003q28 Alzheimer’s disease biomarkers rs9877502 5E-09 — 1.0004q25 Myopia (pathological) rs10034228 8E-13 1.23 0.959

Metabolic traits rs2087160 7E-13 — 0.723Blood pressure rs6825911 9E-09 — 0.839

5p15.33 Myocardial infarction rs11748327 5E-13 1.25 0.8166p21.1 Metabolic traits rs9472155 2E-26 — 0.9456p22.2 Platelet counts rs441460 9E-18 3.08 0.9616p23 Breast cancer rs204247 8E-09 1.05 1.0006q16.1 Migraine* rs11759769 2E-12 1.18 0.8877q22.1 Ulcerative colitis rs7809799 9E-11 1.56 1.0007q31.31 Bone mineral density rs4609139 1E-10 — 0.8528q23.3 HDL cholesterol level rs2293889 6E-11 — 0.7958q24.21 Breast cancer* rs13281615 1E-27 1.09 0.765

Prostate cancer* rs10505483 7E-15 1.73 1.0009q22.32 Height rs10512248 4E-11 — 0.95710p11.23 Dental caries rs399593 9E-09 — 1.00010q21.2 ALL* rs10821936 6E-46 1.86 0.83012p11.22 Breast cancer* rs10771399 8E-31 1.16 0.918

Height rs2638953 7E-17 — 0.88712q13.3 Psoriasis rs2066808 1E-09 1.34 1.000

Height rs2066807 1E-13 — 1.00012q21.2 Myopia (pathological) rs17788937 4E-15 — 0.82913q22.3 Hair color rs975739 2E-14 — 0.87414q32.2 Type 1 diabetes rs4900384 4E-09 1.09 0.776

Graves’ disease rs1456988 5E-09 1.12 0.82215q21.2 Thyroid hormone levels* rs10519227 1E-11 — 0.76315q22.2 Height rs7178424 6E-09 — 0.78115q24.1 Liver enzyme levels rs8038465 1E-09 2.40 0.82516p13.13 Age at menopause rs10852344 1E-11 — 0.75216q22.1 Coronary heart disease rs3729639 2E-11 — 1.00017q23.3 Height* rs2665838 5E-25 — 0.916

Metabolic traits* (including ACE) rs4343 3E-25 16.20 1.00017q25.3 Eye color traits rs9894429 9E-14 — 0.81320q13.32 Blood pressure* rs6015450 4E-23 — 1.000

ORs are given when they were reported in the GWAS catalog (75). Dashes indicate values not reported.*When there are multiple reports of the same phenotype at a locus, the table includes the strongest GWASsignal.

E3988 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al.

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 6: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

two cancer GWASs. We imputed the Alu genotype of each studyparticipant, and tested the genotype–phenotype association. Forthese two cancers, precursor B-cell (pre-B) ALL and prostatecancer, we could more directly ask how the Alu element behavesin relation to disease risk. Specifically, does the presence orabsence of the Alu occur disproportionately in patients relativeto controls?Pre-B ALL is the most common childhood cancer. The

inherited risk of this disease has been mapped to several loci,including the ARID5B locus (P < 10−19) (52–54). ARID5B en-codes the AT-rich interaction domain (ARID) 5B (MRF1-like)protein that is a transcriptional regulator highly expressed indeveloping B cells. The TAS LD block is limited to an ∼34-kbregion of the ARID5B gene, but the causative variant mapping tothis region has not been identified. We identified a 168-bp an-tisense polymorphic AluYb8 element in the third intron ofARID5B (Fig. 5C). The Alu is in strong LD with previouslyidentified TASs (rs7089424 and rs10821936; r2 = 0.83, D′ = 1)and is on the risk haplotype (Fig. 5C). We obtained genotypingdata for patients with pre-B ALL (54) and controls from theFramingham cohort (55). We imputed the Alu genotype forpatients and controls and performed the association analysis.

SNPs at the 5′ end of the gene are not associated with diseaserisk, whereas SNPs ∼20 kb upstream of the Alu are associatedwith disease, and the relationship with disease risk deterioratesquickly downstream. The Alu variant is at the pinnacle of theManhattan plot, indicating that it is highly associated with diseaserisk and compares favorably with other previously reported TASs.We also obtained individual-level genotyping data from pros-

tate cancer GWAS samples to investigate the signal at 8q24 (Fig.5D). This region has long been implicated in epithelial cancerrisk, with the GWAS results for prostate, breast, ovarian, andcolorectal cancer mapping to five separate GWAS peaks (e.g.,refs. 56–63). Four of these peaks are associated with prostatecancer (Fig. S4). Causative genes at this locus are unknown. Al-though several long intergenic noncoding RNAs at this locusmay be up-regulated in prostate cancer (64), much emphasis hasremained on this region regulating protein-coding genes (e.g.,refs. 65, 66). To appreciate fully the mechanism that alters diseaserisk, functional variant(s) associated with each of these four LDblocks must be identified. At one of these regions, we found a301-bp polymorphic AluYb8 element that is in perfect LD (r2 = 1)with the TASs (57, 67, 68) that mark this location (Fig. 5D). TheAlu has an allele frequency of 0.03 and is on the risk haplotype.

10 kb Variants

10 kb 10 kb

r2=1.00 T Alu 0.17 *G Empty 0.83

TMEM18 CFH CFHR1CFHR3CFHR3

Variants

r2=1.00r2=0.91

A B

CT

AG

CA

CT

GA

AT

Empty Alu

0.87 *0.12

CG

2p25.3 1q31.3

10 kb

10 kb

10 kb

r2=0.83 C Alu G 0.21 *T Alu T 0.02 T Empty T 0.74

C Alu AT 0.03 *A Empty CC 0.97

ARID5BARID5BVariants

ARID5B

PCAT1 CASC19CCAT1

CASC21CASC8

CCAT2POU5F1BPCAT2

PRNCR1Variants

10 kb

ARID5BARID5B

C D

Log

dise

ase

asso

ciat

ion

Chromosome position

r2=1.00

PCAT1PRNCR1

Log

dise

ase

asso

ciat

ion

Chromosome position

CASC19CCAT1

02468

1012

181614

02468

1012

1614

63.2 Mb 63.6 Mb 128.51 Mb127.85 Mb

8q24.2110q21.2

Fig. 5. Loci where polymorphic Alu elements are potential causative variants. LD plots show Alu insertions and neighboring SNPs with pairwise comparisonsindicating variants in LD (red). (A) The 2p25.3 locus is associated with obesity (best P = 3 × 10−49) (37). The Alu insertion variant (red) and TASs (blue) areannotated within the LD block (red horizontal line) downstream of the TMEM18 gene. The r2 values between the Alu and TASs are shown to the lower left;phased haplotypes are shown to the lower right. Here, the preinsertion (empty) allele is the risk allele (*), and the Alu insertion segregates with the protectivehaplotype. (B) The 1q31.3 locus associated with meningococcal disease (P = 5 × 10−13) (47); the Alu is on the risk haplotype at CFHR3. (C) The 10q21.2 locus forprecursor B-cell ALL; the Alu is on the risk haplotype. (D) The 8q24 locus for prostate cancer; the Alu is on the risk haplotype. For C and D, we imputed the Aluvariant genotype for patients in the GWAS and controls to test the association between Alu genotype and disease. Graphs show the −log of the P value fordisease association on the y axis; the genomic coordinate is plotted on the x axis. The polymorphic Alu in each case is highly associated with the disease (reddiamond), comparable to proximal TASs (blue triangles).

Payer et al. PNAS | Published online May 2, 2017 | E3989

GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 7: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

Although we were unable to access all study components of themetaanalysis, we were able to recapitulate the signal to a lowersignificance level with publically available data. As expected,given the complete LD between the Alu element and the TASs,the Alu and TASs localize to the peak of the Manhattan plot (Fig.5D). The Alu is therefore an equally good candidate as the pre-viously identified SNP variants and remains a candidate for thefunctional variant at this locus.

DiscussionAlthough polymorphic interspersed repeats are a major sourceof structural variation in the genome, the functional effects ofthese sequences have not been systematically investigated. Weleveraged the accelerated polymorphic retrotransposon discov-ery of recent years (e.g., refs. 9, 10, 21, 22, 36) to identify com-mon polymorphic Alu elements that may alter disease risk, albeitwith the modest effect sizes identified in GWAS. We focused onpolymorphic elements near GWAS signals, where functionalvariants remain elusive. We provide an important resource of allreported polymorphic Alu elements mapping in these intervals(Dataset S3).We identified numerous loci where the polymorphic Alu ele-

ment is a candidate causative variant by association. Specifically,we found 44 Alu insertion polymorphisms in strong LD (r2 > 0.7)with the SNP(s) most associated with a disease phenotype (Table 1and Dataset S3). Thus, at these loci, we find genetic evidence toassociate these Alu variants with functional effects. Although Aluinsertions are well recognized to cause genetic disease as raremutations interrupting coding exons or disrupting splicing, ourstudy indicates that they may also regularly operate as commonvariants effecting risk for common diseases. Only two common Aluinsertion alleles have been previously implicated in human phe-notypes (9, 32–34). We now report 20-fold more candidates.Are these Alu insertions truly the functional variant at each of

these loci? We do not want to overstate the functional role of anyspecific Alu insertion identified in this study. Experimental sys-tems relevant to model effects detected in GWAS are challeng-ing to develop and should rigorously assess multiple candidatecausative variants at each locus after extensive variant discoveryand phenotype association fine-mapping. We make the case thatthese efforts should be designed to consider Alu insertion vari-ants. Alu variants may be inherently more likely than a typicalSNP to have a functional consequence, given that each insertioncreates a structural feature of about 300 bp. Perhaps the stron-gest evidence that Alu polymorphisms deserve special consider-ation as causal variants is the disproportionate co-occurrence ofAlu variants at GWAS loci (P = 0.013). In total, we identified anunexpectedly high number (n = 625) of Alu variants mappingwithin TAS LD blocks (GWAS, P < 10−9) (Fig. 2). This en-richment of polymorphic Alu elements suggests that some arelikely functional.Further studies will be required to determine the extent of

polymorphic Alu involvement in GWAS signals and the functionalmechanism(s) responsible. De novo Alu insertions cause single-gene disease by interrupting coding sequences or disrupting splic-ing (8). For polymorphic Alu elements acting as common variants,molecular effects are expected to be more subtle and have lesspronounced phenotypic consequence. Most of the Alu insertions inour study do not map to known coding or regulatory sequences orhighly conserved sequences. As with GWAS intervals generally, thelocation of these variants does not immediately imply a mecha-nism. Interestingly, mobile genetic elements may themselves carryregulatory sequences and distribute these sequences in the genome

as “plug-and-play” gene regulators (e.g., refs. 69, 70). Although thismodel has not been proven for polymorphic Alu elements, it issupported for evolutionarily older elements that are fixed in thegenome, including Alu elements that act as tissue-specific en-hancers (e.g., refs. 71, 72). Fixed Alu can also provide alternativelyused exons (e.g., refs. 73, 74).The 44 potentially causal polymorphic Alu elements we describe

here likely underrepresent the number of transposable elementinsertions in this category. Although we have assembled the mostinclusive list of common polymorphic Alu elements to date, manymore common variants are predicted to exist than have beenmapped and reported (17, 18). Also, our phasing of these Aluinsertions with surrounding SNPs to discern haplotypes focusedon common Alu insertion alleles in the European population.Inclusion of more diverse human populations in polymorphismdiscovery and targeted discoveries in patient populations will likelyincrease the number of candidate functional variants.

Materials and MethodsThe Johns Hopkins University School of Medicine Institutional Review Boardreviewed and approved our application for Framingham Heart Study dataaccess (NA 00092855), and access was approved by the study data accesscommittee.

Transposable Elements Mapping to GWAS Signals. TIP-chip (23) was used tomap insertions (SI Materials and Methods). Previously reported polymorphicAlu elements were collected (10, 11, 21, 22, 24–27). The LD block for eachGWAS signal (P ≤ 10−8) was defined by proxy SNPs to the TAS (r2 > 0.8) (SIMaterials and Methods). Overlaps are reported in Dataset S3.

Enrichment of polymorphic Alu elements near GWAS signals was calcu-lated by comparing the number of overlaps of polymorphic Alu elementsand GWAS LD blocks with the number of polymorphic Alu elements and1,000 randomized sets of LD blocks. All reported GWAS signals (P ≤ 10−9)were reduced to a list of 3,242 nonoverlapping TAS LD blocks. To generate1,000 sets of random LD blocks (3,242 blocks per set) with similar charac-teristics to these GWAS LD blocks, the characteristics of the TASs anchoringGWAS LD blocks were used to select random SNPs. Specifically, the randomSNPs had allele frequencies (within bins of 5%) and distances to the nearestgene (within gene, <10 kb, or >10 kb) matching those parameters of theTAS. These sets of 3,242 random SNPs were used to define 1,000 new sets ofLD blocks (3,242 LD blocks per iteration). We recorded the number of timesthat known polymorphic Alu elements map to each set of 3,242 random LDblocks and compared the observed value (the number of times that the sameelements map to TAS LD blocks) with these expected values. PolymorphicAlu elements and LD blocks mapping to the sex chromosomes and HLA locuswere excluded to eliminate any bias owing to unequal ascertainment atthese regions or the large intervals of LD at HLA.

Genotyping of Alu Elements and LD Analysis. To focus on Alu elements thatare common polymorphisms, we conducted an initial screen in pooled DNAsamples. Detected polymorphic Alu elements were genotyped in a 30-trioreference panel of CEU HapMap samples by PCR (SI Materials and Methods).LD between the Alu variant and TAS was defined as r2 ≥ 0.4 and D′ ≥ 0.8,with particular emphasis placed on those variants in stronger LD with r2 ≥0.7 (SI Materials and Methods and Dataset S3).

ACKNOWLEDGMENTS. We thank Drs. Wenjian Yang and Mary Relling (St.Jude Children’s Research Hospital) for genotyping data; Mary Relling, JohnMoran (University of Michigan), David Valle, Aravinda Chakravarti, HaigKazazian, and members of the K.H.B. laboratory for helpful discussion ofthis project and review of the manuscript; and Tim Babatz, Emily Robinson,Allison Moyer, Hannah Bogen, Nicholas Frisco, Reona Kimura, and Tianqi(Nina) Luo for technical assistance. This work was funded by National Heart,Lung, and Blood Institute Grant T32HL007525; a Burroughs Wellcome FundCareer Award for Biomedical Scientists Program (to K.H.B.); US NIH AwardsR01CA163705 (to K.H.B.) and R01GM103999 (to K.H.B.), as well as Centerfor Systems Biology of Retrotransposition Grant P50GM107632 (to K.H.B.and J.D.B.).

1. Lander ES, et al.; International Human Genome Sequencing Consortium (2001) Initial

sequencing and analysis of the human genome. Nature 409:860–921.2. Kellis M, et al. (2014) Defining functional DNA elements in the human genome. Proc

Natl Acad Sci USA 111:6131–6138.

3. Smit AFA, Hubley R, Green P (2015) RepeatMasker Open-4.0. Available at www.

repeatmasker.org. Accessed April 26, 2017.4. Kazazian HH, Jr, et al. (1988) Haemophilia A resulting from de novo insertion of

L1 sequences represents a novel mechanism for mutation in man. Nature 332:164–166.

E3990 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al.

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 8: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

5. Sukarova E, Dimovski AJ, Tchacarova P, Petkov GH, Efremov GD (2001) An Alu insertas the cause of a severe form of hemophilia A. Acta Haematol 106:126–129.

6. Ganguly A, Dunbar T, Chen P, Godmilow L, Ganguly T (2003) Exon skipping caused byan intronic insertion of a young Alu Yb9 element leads to severe hemophilia A. HumGenet 113:348–352.

7. Green PM, Bagnall RD, Waseem NH, Giannelli F (2008) Haemophilia A mutations inthe UK: Results of screening one-third of the population. Br J Haematol 143:115–128.

8. Hancks DC, Kazazian HH, Jr (2016) Roles for retrotransposon insertions in humandisease. Mob DNA 7:9.

9. Sudmant PH, et al.; 1000 Genomes Project Consortium (2015) An integrated map ofstructural variation in 2,504 human genomes. Nature 526:75–81.

10. Hormozdiari F, et al. (2011) Alu repeat discovery and characterization within humangenomes. Genome Res 21:840–849.

11. Stewart C, et al.; 1000 Genomes Project (2011) A comprehensive map of mobile ele-ment insertion polymorphisms in humans. PLoS Genet 7:e1002236.

12. Mathias SL, Scott AF, Kazazian HH, Jr, Boeke JD, Gabriel A (1991) Reverse transcrip-tase encoded by a human transposable element. Science 254:1808–1810.

13. Moran JV, et al. (1996) High frequency retrotransposition in cultured mammaliancells. Cell 87:917–927.

14. Dewannieux M, Esnault C, Heidmann T (2003) LINE-mediated retrotransposition ofmarked Alu sequences. Nat Genet 35:41–48.

15. Raiz J, et al. (2012) The non-autonomous retrotransposon SVA is trans-mobilized bythe human LINE-1 protein machinery. Nucleic Acids Res 40:1666–1683.

16. Hancks DC, Goodier JL, Mandal PK, Cheung LE, Kazazian HH, Jr (2011) Retro-transposition of marked SVA elements by human L1s in cultured cells. HumMol Genet20:3386–3400.

17. Ewing AD, Kazazian HH, Jr (2010) High-throughput sequencing reveals extensivevariation in human-specific L1 content in individual human genomes. Genome Res 20:1262–1270.

18. Watterson GA (1975) On the number of segregating sites in genetical models withoutrecombination. Theor Popul Biol 7:256–276.

19. Deininger PL, Moran JV, Batzer MA, Kazazian HH, Jr (2003) Mobile elements andmammalian genome evolution. Curr Opin Genet Dev 13:651–658.

20. Xing J, et al. (2009) Mobile elements create structural variation: Analysis of a com-plete human genome. Genome Res 19:1516–1526.

21. Wang J, et al. (2006) dbRIP: A highly integrated database of retrotransposon insertionpolymorphisms in humans. Hum Mutat 27:323–329.

22. Witherspoon DJ, et al. (2013) Mobile element scanning (ME-Scan) identifies thou-sands of novel Alu insertions in diverse human populations. Genome Res 23:1170–1181.

23. Huang CR, et al. (2010) Mobile interspersed repeats are major structural variants inthe human genome. Cell 141:1171–1182.

24. Shukla R, et al. (2013) Endogenous retrotransposition activates oncogenic pathwaysin hepatocellular carcinoma. Cell 153:101–111.

25. Lee E, et al.; Cancer Genome Atlas Research Network (2012) Landscape of somaticretrotransposition in human cancers. Science 337:967–971.

26. Witherspoon DJ, et al. (2010) Mobile element scanning (ME-Scan) by targeted high-throughput sequencing. BMC Genomics 11:410.

27. Iskow RC, et al. (2010) Natural mutagenesis of human genomes by endogenous ret-rotransposons. Cell 141:1253–1261.

28. Witherspoon DJ, et al. (2009) Alu repeats increase local recombination rates. BMCGenomics 10:530.

29. Frazer KA, et al.; International HapMap Consortium (2007) A second generationhuman haplotype map of over 3.1 million SNPs. Nature 449:851–861.

30. Naj AC, et al. (2013) Genetic factors in nonsmokers with age-related macular de-generation revealed through genome-wide gene-environment interaction analysis.Ann Hum Genet 77:215–231.

31. Charles BA, et al. (2011) A genome-wide association study of serum uric acid in Af-rican Americans. BMC Med Genomics 4:17.

32. Rigat B, et al. (1990) An insertion/deletion polymorphism in the angiotensinI-converting enzyme gene accounting for half the variance of serum enzyme levels.J Clin Invest 86:1343–1346.

33. Tiret L, et al. (1992) Evidence, from combined segregation and linkage analysis, that avariant of the angiotensin I-converting enzyme (ACE) gene controls plasma ACElevels. Am J Hum Genet 51:197–205.

34. Chung CM, et al. (2010) A genome-wide association study identifies new loci for ACEactivity: Potential implications for response to ACE inhibitor. Pharmacogenomics J 10:537–544.

35. Hindorff LA, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106:9362–9367.

36. Konkel MK, et al.; 1000 Genomes Consortium (2015) Sequence analysis and charac-terization of active human Alu subfamilies based on the 1000 Genomes Pilot Project.Genome Biol Evol 7:2608–2622.

37. Speliotes EK, et al.; MAGIC; Procardis Consortium (2010) Association analyses of249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42:937–948.

38. Speakman JR (2013) Functional analysis of seven genes linked to body mass index andadiposity by genome-wide association studies: A review. Hum Hered 75:57–79.

39. Volckmar AL, et al. (2016) Analysis of genes involved in body weight regulation bytargeted re-sequencing. PLoS One 11:e0147904.

40. Rask-Andersen M, et al. (2015) Determination of obesity associated gene variantsrelated to TMEM18 through ultra-deep targeted re-sequencing in a case-control co-hort for pediatric obesity. Genet Res 97:e16.

41. Liu CT, et al. (2014) Sequence variation in TMEM18 in association with body massindex: Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE)Consortium Targeted Sequencing Study. Circ Cardiovasc Genet 7:344–349.

42. Berndt SI, et al. (2013) Genome-wide meta-analysis identifies 11 new loci for an-thropometric traits and provides insights into genetic architecture. Nat Genet 45:501–512.

43. Graff M, et al.; GIANT Consortium (2013) Genome-wide analysis of BMI in adolescentsand young adults reveals additional insight into the effects of genetic loci over thelife course. Hum Mol Genet 22:3597–3607.

44. Wheeler E, et al. (2013) Genome-wide SNP and CNV analysis identifies common andlow-frequency variants associated with severe early-onset obesity. Nat Genet 45:513–517.

45. Willer CJ, et al.; Wellcome Trust Case Control Consortium; Genetic Investigation ofAnthropometric Traits Consortium (2009) Six new loci associated with body mass in-dex highlight a neuronal influence on body weight regulation. Nat Genet 41:25–34.

46. Thorleifsson G, et al. (2009) Genome-wide association yields new sequence variants atseven loci that associate with measures of obesity. Nat Genet 41:18–24.

47. Davila S, et al.; International Meningococcal Genetics Consortium (2010) Genome-wide association study identifies variants in the CFH region associated with hostsusceptibility to meningococcal disease. Nat Genet 42:772–776.

48. Emonts M, Hazelzet JA, de Groot R, Hermans PW (2003) Host genetic determinants ofNeisseria meningitidis infections. Lancet Infect Dis 3:565–577.

49. Haralambous E, et al. (2003) Sibling familial risk ratio of meningococcal disease in UKCaucasians. Epidemiol Infect 130:413–418.

50. Schneider MC, et al. (2009) Neisseria meningitidis recruits factor H using proteinmimicry of host carbohydrates. Nature 458:890–893.

51. Brouwer MC, et al. (2009) Host genetic susceptibility to pneumococcal and menin-gococcal disease: A systematic review and meta-analysis. Lancet Infect Dis 9:31–44.

52. Xu H, et al. (2013) Novel susceptibility variants at 10p12.31-12.2 for childhood acutelymphoblastic leukemia in ethnically diverse populations. J Natl Cancer Inst 105:733–742.

53. Papaemmanuil E, et al. (2009) Loci on 7p12.2, 10q21.2 and 14q11.2 are associatedwith risk of childhood acute lymphoblastic leukemia. Nat Genet 41:1006–1010.

54. Treviño LR, et al. (2009) Germline genomic variants associated with childhood acutelymphoblastic leukemia. Nat Genet 41:1001–1005.

55. Dawber TR, Meadors GF, Moore FE, Jr (1951) Epidemiological approaches to heartdisease: The Framingham Study. Am J Public Health Nations Health 41:279–281.

56. Easton DF, et al.; SEARCH collaborators; kConFab; AOCS Management Group (2007)Genome-wide association study identifies novel breast cancer susceptibility loci.Nature 447:1087–1093.

57. Gudmundsson J, et al. (2007) Genome-wide association study identifies a secondprostate cancer susceptibility variant at 8q24. Nat Genet 39:631–637.

58. Haiman CA, et al. (2007) Multiple regions within 8q24 independently affect risk forprostate cancer. Nat Genet 39:638–644.

59. Haiman CA, et al. (2007) A common genetic risk factor for colorectal and prostatecancer. Nat Genet 39:954–956.

60. Schumacher FR, et al. (2007) A common 8q24 variant in prostate and breast cancerfrom a large nested case-control study. Cancer Res 67:2951–2956.

61. Tomlinson IP, et al.; CORGI Consortium; EPICOLON Consortium (2008) A genome-wideassociation study identifies colorectal cancer susceptibility loci on chromosomes10p14 and 8q23.3. Nat Genet 40:623–630.

62. Yeager M, et al. (2007) Genome-wide association study of prostate cancer identifies asecond risk locus at 8q24. Nat Genet 39:645–649.

63. Ghoussaini M, et al.; UK Genetic Prostate Cancer Study Collaborators/British Associ-ation of Urological Surgeons’ Section of Oncology; UK Protect Study Collaborators(2008) Multiple loci with different cancer specificities within the 8q24 gene desert.J Natl Cancer Inst 100:962–966.

64. Bawa P, et al. (2015) Integrative analysis of normal long intergenic non-coding RNAsin prostate cancer. PLoS One 10:e0122143.

65. Pomerantz MM, et al. (2009) The 8q24 cancer risk variant rs6983267 shows long-rangeinteraction with MYC in colorectal cancer. Nat Genet 41:882–884.

66. Meyer KB, et al. (2011) A functional variant at a prostate cancer predisposition locusat 8q24 is associated with PVT1 expression. PLoS Genet 7:e1002165.

67. Gudmundsson J, et al. (2009) Genome-wide association and replication studiesidentify four variants associated with prostate cancer susceptibility. Nat Genet 41:1122–1126.

68. Cheng I, et al. (2012) Evaluating genetic risk for prostate cancer among Japanese andLatinos. Cancer Epidemiol Biomarkers Prev 21:2048–2058.

69. Feschotte C (2008) Transposable elements and the evolution of regulatory networks.Nat Rev Genet 9:397–405.

70. Sundaram V, et al. (2014) Widespread contribution of transposable elements to theinnovation of gene regulatory networks. Genome Res 24:1963–1976.

71. Jacobsen BM, Jambal P, Schittone SA, Horwitz KB (2009) ALU repeats in promoters areposition-dependent co-response elements (coRE) that enhance or repress transcrip-tion by dimeric and monomeric progesterone receptors.Mol Endocrinol 23:989–1000.

72. Romanish MT, Nakamura H, Lai CB, Wang Y, Mager DL (2009) A novel protein isoformof the multicopy human NAIP gene derives from intragenic Alu SINE promoters. PLoSOne 4:e5761.

73. Gal-Mark N, Schwartz S, Ast G (2008) Alternative splicing of Alu exons–two arms arebetter than one. Nucleic Acids Res 36:2012–2023.

74. Sorek R, Ast G, Graur D (2002) Alu-containing exons are alternatively spliced. GenomeRes 12:1060–1067.

75. Welter D, et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-traitassociations. Nucleic Acids Res 42:D1001–D1006.

Payer et al. PNAS | Published online May 2, 2017 | E3991

GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020

Page 9: Structural variants caused by Alu insertions are · Alu polymorphisms in regions of the genome associated with humandisease risk. Further, we find 44 instances where thetrait-associated

76. Carroll ML, et al. (2001) Large-scale analysis of the Alu Ya5 and Yb8 subfamilies andtheir contribution to human genomic diversity. J Mol Biol 311:17–40.

77. Johnson AD, et al. (2008) SNAP: A web-based tool for identification and annotationof proxy SNPs using HapMap. Bioinformatics 24:2938–2939.

78. Afgan E, et al. (2016) The Galaxy platform for accessible, reproducible and collabo-rative biomedical analyses: 2016 update. Nucleic Acids Res 44:W3–W10.

79. Pruitt KD, et al. (2014) RefSeq: An update on mammalian reference sequences.Nucleic Acids Res 42:D756–D763.

80. Quinlan AR, Hall IM (2010) BEDTools: A flexible suite of utilities for comparing ge-nomic features. Bioinformatics 26:841–842.

81. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical andpowerful approach to multiple testing. J R Statist Soc B 57:289–300.

82. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: Analysis and visualization of LDand haplotype maps. Bioinformatics 21:263–265.

83. Purcell S, et al. (2007) PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575.

84. Delaneau O, Marchini J, Zagury JF (2011) A linear complexity phasing method forthousands of genomes. Nat Methods 9:179–181.

85. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputationmethod for the next generation of genome-wide association studies. PLoS Genet 5:e1000529.

E3992 | www.pnas.org/cgi/doi/10.1073/pnas.1704117114 Payer et al.

Dow

nloa

ded

by g

uest

on

Mar

ch 2

9, 2

020