Upload
mildred-singleton
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
The Ashkenazi Genome Project
Shai CarmiPe’er lab, Columbia University
andThe Ashkenazi Genome Consortium (TAGC)
BostonSeptember 2013
Outline
• Ashkenazi Jewish (AJ) Genetics and TAGC
• Basic Variant Statistics
• Utility in AJ Medical Genetics
• Demographic History of AJ and Europeans
• Summary
Ashkenazi Jewish (AJ) Genetics & TAGC
Recent History of Ashkenazi Jews (AJ)
• Mediterranean origin (?)• Ca. 1000:
Small communities in Northern France, Rhineland
• Migration east• Expansion• Migration to US and Israel• ≈10M today• Relative isolation
Ashkenazi Jewish Genetics
Behar et al., Nature, 2010Bray et al., PNAS, 2010Guha et al., Genome Biol., 2012
300 Jewish individuals; SNP arrays
• Recently, AJ shown to be genetically distinct• Close to Middle-Easterners & South-Europeans
Price et al., PLoS Genet., 2008Olshen et al., BMC Genet., 2008Need et al., Genome Biol., 2009Kopelman et al., BMC Genet., 2009
AJ
Atzmon et al., AJHG, 2010
Jewish non-AJ
Middle-Eastern
Europeans
Recent Demography & IBD
A B
AB
A shared segment
• Recent, strong genetic drift leads to long identical-by-descent haplotypes.
• IBD sharing common in AJ(Gusev et al., MBE, 2011 and others)
• Inferred bottleneck of just ≈300 individuals ≈800 ya(Palamara et al., AJHG, 2012)
Ashkenazi-Jewish (AJ) Genetic Risk Factors
• Multitude of Mendelian disorders
– Carrier screening: A success story
• Breast and ovarian cancer: BRCA1, BRCA2
• Parkinson’s disease: LRRK2, GBA
Gravel et al., 2001Tay-Sachs births
AJ Genetics: Summary & Prospects
Large population (≈10M) Narrow bottleneck (≈300) Mostly isolated Recruitable Well studied Insight on both European
and Middle-Eastern past
× No genealogies× Mobile× Some recent
admixture× Significant ancient
admixture
The Ashkenazi Genome Consortium
Phase I:• 128 AJ personal genomes• Healthy controls• Unrelated, PCA-validated AJ• Technology: Complete Genomics
Goal:• 11+5 labs, mostly from the NY area• Sequence to high coverage hundreds of healthy AJ
o Use as a reference panel for imputation and clinical interpretationo Improve understanding of population history and
functional genetic variation in AJ
Basic Variant Statistics
Variant Statistics &Comparison to Europeans
• Comparison panels:o 1000 Genomes Europeanso 26 Flemish from Belgium, sequenced by Complete Genomics
Projection method: Gravel et al., PNAS, 2011
Allele Frequency Spectrum
Utility in AJ Medical Genetics
Screening AJ GenomesAn ancestry-matched reference panel is expected to filter more benign variants in clinical genomes.
A Catalog of Mutations in Known AJ Disease Genes
• Tens of genes harbor known mutations for AJ-prevalent Mendelian disorders or risk factors for multifactorial diseases.o Tay-Sachs disease, Gaucher disease, Familial dysautonomia, Niemann-Pick disease,
Torsion dystonia, Canavan disease, Bloom syndrome, etc.o Breast cancer (BRCA1/2), Colon cancer (APC), Parkinson’s (LRRK2), etc.
• We mapped 73 mutations in 48 genes.
• Detected carriers of 35 known disease mutations.
• Detected 184 missense and 18 loss-of-function novel (dbSNP135) variants.o Catalog will be made available.
Imputing AJ Arrays
• AJ outperforms CEU even for a larger CEU panel• Accuracy improved across all frequencies and by all measures
— Discordance rate, r2, false negatives/positives, Impute2 metrics
Imputation by IBD• Impute by copying long IBD segments from a fully sequenced genome into
a sparsely genotyped one.– Only 1-2 recent mutations per segment are expected
• IBD detected using Germline with additional filtering.
Fit to:>3cM
A Short Detour:
A Model for the Expected Coverage
Coverage by IBD: Theory• Problem statement:
– Reference panel (say, fully sequenced) of size nr
– Study panel (say, sparsely genotyped) of size ns
– Detect all IBD segments of length >m (Morgan) between study and reference panels– What is the average fraction of a study genome covered by IBD segments to the reference panel?
• Assumptions:– Haploid (phased), infinite genomes– All segments can be detected– Coalescent with recombination– Recombination breaks a shared segment (B>>1)
𝑁→∞
𝑁→∞
B
Time(generations)
Present
g
g+1
𝑁→∞
Prob. 1-α
Coverage by IBD: Theory• Exact solution:
– Define and – Denote the average coverage as
• Limits:– For (small reference panel, wide bottleneck), – For , – For (short length cutoff, recent bottleneck), – For ,
• Approximation:
– , – Fits very well numerically
• Diploids:
Demographic History of AJ &
Europeans
Recent AJ History Using IBD
• Assume a population of historical size diploids– Time scaled by 2N0
• Fraction of the genome in segments of length :Palamara et al., AJHG 2012
• Detect IBD in sample Infer history
Ancient History, One Population at a Time
• Fit the allele frequency spectrum, computed using diffusion• (∂a∂I, Gutenkunst et al., PLoS Genetics, 2009)
A Consequence
• Number of segregating sites Sn(t) – Zivkovic and Stephan, Theor. Pop. Biol. 2011
• n: #diploid samples; θ=4N0μ; μ: mutation rate per generation
Principal Component Analysis
Ancient History
What we know/learned so far:• AJ are a Middle-Eastern:European mix• Slightly higher heterozygosity (+2.4%)
– Larger ancient population size– Admixture– Recent explosive growth
• Many more AJ-specific variants – +14% for 25x25 genomes
• Out-of-Africa (Henn et al., PNAS, 2012)
– ≈50-60 kya– Serial founder model: Africa → Middle-East → Europe– Hunter-gatherers in Europe at ≈40-45 kya (Higham et al., Nature, 2011)
– Bottleneck and expansion at each step
The Joint AFS
• Allele frequencies correlated but substructure exists.
• Experimenting with inference using the joint AFS— For our sample size, can infer at
most ≈10 parameters— Hard to infer very recent history— Hard to infer migration rates
A Proposed ModelTime
Present
N0
Nb,OOA
Nf,AJ
Tb,OOA
Nb,EU
Nf,EU
Tb,EU
Tafa
Flemish AJ
The Inferred Model
Time(years ago)
Present
6500
230052,000
1800
58,000
10,800
170055%
Flemish AJ
7500
Out-of-Africa?
Early Neolithic migrants?
Jewish diaspora?
Middle-East/Levant?
European OriginsFarming began in Europe ≈5-8kya (“the Neolithic revolution”)
Spread of ideas (“cultural diffusion”)
Human migration(“demic diffusion”)
• For cultural diffusion, split from Middle-Easterners at ≈40-45 kya.
• We estimate ≈11 kya• Earlier than ≈5-8 kya perhaps due to
• Early substructure before actual migration• Incomplete replacement of hunter-gatherers• Traces of recovery from the Last Glacial Maximum
Confidence Intervals
Parameter Maximum likelihood
Bias-corrected mean±SD
95% confidence interval
6543 6523 25± [6475 , 6572]2256 2314 47± [2223 , 2406]
53,050 52,007 1561± [48,947 , 55,067]7632 7494 193± [7116 , 7872]1556 1802 28± [1748 , 1857]
10,600 10,835 188± [10,467 , 11202]56,519 57,977 2912± [52,270 , 63,685]1940 1686 98± [1495 , 1878]55% 55% 1%± [53% , 57%]
• Parametric bootstrap: o Simulate whole genomes with the maximum likelihood parameters
o MaCS, Chen et al., Genome Res., 2009o Infer using the simulated datasets
Hmmm…
Mutation rate
Model specification
Mutation Rate
• We used per bp per generation: the “phylogenetic rate”.• The “de-novo rate” is , and would double all population sizes and times.• We preferred the phylogenetic rate for a few (weak) reasons
– False negatives may exist in some de-novo studies– The de-novo rate does not account for selection– With the de-novo rate, the Out-of-Africa time would be >100 kya
• A decrease of 50% in the mutation rate will bring the split time to ≈16 kya– Support the LGM recovery hypothesis– Identify the Middle-East as the source of the recovery
• (Haber et al, PLoS Genetics, 2013; Pala et al., AJHG 2012)
– Still suggests genetic discontinuity from first hunter-gatherers who colonized Europe
• Debate is still open
Model SpecificationWe tried several alternative models• All models support >50% European ancestry in AJ and European-Middle-Eastern
split 10-15 kya.• For example, a two-wave model for the population of Europe supports LGM
recovery + Neolithic replacement:
Summary & Outlook• We sequenced 128 healthy AJ genomes to high coverage.
• Our reference panel will improve:– Screening of AJ clinical genomes or known disease genes– Imputation of AJ SNP arrays
• IBD sharing indicates a very recent bottleneck and expansion.
• The AJ-European joint allele frequency spectrum suggests:– Over 50% European ancestry in AJ– Europeans diverged from Middle-Easterners only ≈10-15 kya– Made possible by sequencing population with partly Middle-Eastern ancestry
• In the future:– Sequence ≈200 more genomes to cover entire bottleneck– Use genomes from more populations to fine-tune demographic models
Thank you!TAGC consortium members:Columbia University Computer Science:Itsik Pe’erFillan Grady, Ethan Kochav, James XueShlomo HershkopLong-Island Jewish Medical Center:Todd Lencz, Semanti Mukherjee, Saurav GuhaColumbia University Medical Center:Lorraine Clark, Xinmin LiuAlbert Einstein College of Medicine:Gil Atzmon, Harry Ostrer, Nir Barzilai, Kinnari Upadhyay, Danny Ben-AvrahamMount Sinai School of Medicine:Inga Peter, Laurie OzeliusMemorial Sloan Kettering Cancer Center:Ken Offit, Joseph Vijai Yale School of Medicine:Judy Cho, Ken Hui, Monica BowenThe Hebrew University of Jerusalem:Ariel Darvasi
Funding:Human Frontiers Science program
VIB, Gent, BelgiumHerwig Van Marck, Stephane PlaisanceComplete GenomicsOmicia
AJ Genetics
2,300
N
t
Effective size
45,000270
4,300,000
Years ago
800
PresentPalamara et al., AJHG 2012
0%
20%
40%
60%
80%
100%
0 50 100 150 200 250 300 350 400 450 500
# of Sequenced Individuals
% A
dditi
onal
Info
rmati
on P
oten
tial
WTCCC AJ_SCZ AJUK
Pow
er o
f im
puta
tion
by IB
D
Complete Genomics WGS
Quality Control
Property Genome (exome)Coverage ≈56x
Fraction called 96.7±0.3% (98.1%)Fraction with coverage > 20x 92.7±1.6% (94.9%)Concordance with SNP array 99.67±0.25%
Ti/Tv ratio 2.14±0.004 (3.05)
Ti/T
v
• 128 samples from two labs were sequenced in 3 batches• Minimal batch effects
• Some results are for the first batch of 57 genomes
Quality Control• False positive rate assessment
— Counting (the few) hets inside long runs of homozygosity— A duplicate sample
• Genome wide extrapolation: – SNVs: ≈10-40k FP per genome (FDR: 0.3-1.3%)– Indels: ≈10-30k FP per genome (FDR: 2-6%)
• QC: – Remove indels and poly-allelic variants– Remove HWE violations, low call rate
• FP after QC: ≈5k per genome.
hets
roh
Concordance with Arrays
0.05%Asymptotic discordance
Processing and Cleaning Pipeline
58 Complete Genomics masterVar (hg19)
AJ
VCF file
CGA tools mkvcf
Remove low-quality, half-called, or non-SNVs Remove variants not fully called in at least one individual
Remove inbred individual
Custom script; Plink/Seq
Remove poly-alleleic variantsRemove variants with high no-call rate or that are
not in Hardy-Weinberg equilibrium
Cohort-based cleaning
Plink file
Local cleaning
26 Complete Genomics masterVar (hg18)
Flemish
testvariants file
CGA tools
Liftover hg18 => hg19Remove low-quality, half-called, or non-SNVs
Remove variants not fully called in at least one individual
Cohort-based cleaning
Plink file
Local cleaning
VCF file
Custom script
Remove coordinates with reference mapping problemRemove variants with AJ-Flemish incompatible alleles
Initial filtering
Variant in both cleaned files?
Keep
Variant in one cleaned file and in
the VCF of the other?
Discard
Variant in one cleaned file and not
at all in other?
Keep and set other as hom-ref
Merge AJ-Flemish genotypes
Remove variants incompatible with 1000 Genomes
Phase and impute sporadically missing genotypes
SHAPEIT; using 1000 Genomes panel
Phase using molecular phasing
information
seqphase
128 Complete Genomics masterVar (hg19)
AJ complete project
testvariants file
CGA tools
Remove low-quality, half-called, or non-SNVs Remove variants not fully called in at least one individual
Remove poly-alleleic variantsRemove variants with high no-call rate or that are
not in Hardy-Weinberg equilibrium
Cohort-based cleaning
Plink file
Local cleaning Custom script
Phase and impute sporadically missing values
Validate AJ ancestryValidate no cryptic relatedness
Summary stats, array concordance, and
duplicates analyses
Ti/Tv statistics
Monomorphicnon-ref and
runs-of-homozygosity
analyses
SHAPEIT
Mobile Element Insertions (MEIs) & Copy Number Variants (CNVs)
Initial validation efforts suggested high false discovery rate, at least for novel events.
Novel MEIs: • 3/11 validated• Strong batch effect1000 Genomes MEIs
Variant StatisticsStatistic Per genome (exome)
Total SNPs 3.4M (22k)
Novel SNPs 3.8% (4.1%)
Het/hom ratio 1.65 (1.67)
Insertions count 220k (242)
Deletions count 235k (223)
Substitutions count 83k (374)
Synonymous SNPs 10,536
Non-synonymous SNPs 9706
Nonsense SNPs 72
Other disrupting 255
CNV count 302
SV count 1480
MEI count 4090
Imputing AJ ArraysCompare imputation accuracy of AJ SNP arrays when using either AJ or European reference panels.
AJ Arrays (1000) Phased AJ Sequences (57)
7
1000 Genomes CEU (87)
AJ arrays (1007) Reference
Panel 1 (50)Reference
Panel 2 (87)Reference
Panel 3 (137)
Study Panel (1007)
Phase (ShapeIT)
Imputed Study Panel 1
Imputed Study Panel 2
Imputed Study Panel 3
87 87Reduce to unphased arrays1000 50 50
Impute (Impute2)
Mutation Burden in AJ• Theoretically, a narrow bottleneck should increase the load of
deleterious variants (e.g., Lohmuller, Nature, 2008)o Or not? (Simons et al., arXiv, 2013)o Expect higher load in AJ.
• Define deleterious:o Derived? Minor? Non-reference? Rare?o How to weight each variant?o Account for demography, sequencing errors? o Define significance?
• Compare 26 AJ and 26 Flemish.
• AJ have between 1-10% more deleterious variants than expected (using Flemish as baseline). P-values between 0.2 and 10-60.
Mutation Burden in Disease Categories• Many diseases have been
suggested to be more prevalent in AJ (Goodman 1979)o Several Mendelian disorderso Some cancerso Inflammatory bowel diseaseso Diabetes, obesityo Some psychiatric diseases, myopia
• Annotate genes according to disease category (Omicia Inc).
• Compare non-synonymous variant load between AJ and Flemish.
Disease category #genes AJ/FL ratio
Aging 106 1.07Infectious 70 1.03Neonatal 956 1.02
Gastrointestinal 254 1.02Dental 86 1.01
Immunological 474 1.01Hemic 202 1.01
Cardiovascular 502 1.01Endocrinological 750 1.01
Oncological 471 1.01Women’s 39 1.00
Drug 82 1.00Neurological 980 1.00
Nutrition 29 0.99Respiratory 187 0.99
Kidney 285 0.96Psychiatric 21 0.93
• No category comes out significant in Gene Set Enrichment Analysis.
AJ EU
IBD observed
Het/Hom Ratiot
Years ago
Present