Upload
bioinformaticsinstitute
View
176
Download
2
Tags:
Embed Size (px)
DESCRIPTION
http://bioinformaticsinstitute.ru/lectures/840
Citation preview
Секвенирование
как
инструмент исследования
сложных
фенотипов
человека: от
генов
к
полным
геномам
Василий
РаменскийUCLA Center for Neurobehavioral Genetics
2 октября
2014 г.
UCLA Center for Neurobehavioral Genetics, Los Angeles USA
University of California Los AngelesCenter for Neurobehavioral Genetics
Introduction
(a) Contribution of genetic factors// Genetic ≠
inherited: de novo mutations
(b) Non-Mendelian inheritance
What is a complex phenotype?
(a) Contribution of genetic factors// Genetic ≠
inherited: de novo mutations
(b) Non-Mendelian inheritance
SCZ, schizophrenia; ASD, autistic spectrum disorders; BP, bipolar disorder; AD, Alzheimer’s disease, ADHD, attention deficit hyperactivity disorder; TS, Tourette syndrome; OCD, obsessive compulsive disorder; ID, intellectual disability
What is a complex phenotype?
-- Heritable quantitative traits Examples: working memory, executive function, sociability,
attention, temperament, brain measures
-- Hypothesis: individuals diagnosed with conditions like ASD or SCZ may be at the extreme end of distribution for some endophenotypes; risk prediction
-- Hope: simpler genetic architectures than clinical diagnoses, easier to dissect
Endophenotypes: intermediate layer
Tactical-- Loci involved (in an individual and in the population)-- Causal allele spectrum at each loci: rare, common…-- Loci interaction: common allele as a modifier of rare ones
Strategical-- Risk prediction-- Identification of disease pathways treatment
Goals of genetic analysis
Experiment-- Genome-wide association analysis (GWAS)-- Sequencing: DNA-Seq, RNA-Seq, ChIP-Seq, …
Data analysis-- Bioinformatics: variant calling and quality control-- Bioinformatics: variant annotation and functionality prediction-- Statistical genetics: single variant or gene level association
analysis
Validation-- Followup genotyping-- Model organisms, in vitro experiments
Methods
Genetic architecture of disease
Sullivan et al. 2012; Mitchell 2014
-- AD, BP, CSZ: allelic spectrum and aetiological role for both rare and common variation
-- ASD, SCZ: variation at hundreds of different genes involved; organized in pathways
-- AD: unexpected cholesterol metabolism and the innate immune response pathways
-- ASD: de novos-- The same SNVs in ASD, SCZ, epilepsy, ADHD, ID and other (Mitchell
2014)-- SCZ: GWAS points to verified and predicted targets of non-coding
RNA miR-137
Genetic architecture of disease
Genetic architecture of disease
1) High risk rare alleles causing Mendelian disease-- Mostly coding: nonsense, missense, splice site, indels-- Examples: APP or PS mutations in AD; LRRK2 mutations in
Parkinson’s disease2) Moderate risk low frequency alleles-- Example: GBA mutations in Parkinson’s disease-- Most difficult to detect earlier3) Low risk common alleles-- Detectable by GWAS-- Examples: SNCA or MAPT inParkinson’s; CLU, PICALM , CR1: AD
-- Rarely coding; gene regulation?
Genetic architecture of disease
4) High risk common alleles-- Examples: APOE mutations in AD; complement H factor in
macular degeneration-- Easily identifiable by GWAS-- Late onset diseases
5) De novo mutations-- Example: autism-- Diseases which affect reproductive fitness-- Requires trio sequencing
Genetic architecture of disease
6) Low risk rare variants-- Expected to affect gene regulation, splicing etc.-- Most difficult to identify, require:
-- large number of cases and controls, -- reliable bioinformatic and statistical genetics methods; -- functional followup
“Auxiliary” alleles:
7) Alleles in phenotype modifier genes-- Example: modifier genes in cystic fibrosis8) Alleles in epistasis with the disease one-- Example: Bardet-Biedl syndrome
Genetic architecture of disease
Published GWA at p≤5X10-8 for 18 trait categories (07/2012)
NHGRI GWA Catalogwww.genome.gov/GWAStudieswww.ebi.ac.uk/fgpt/gwas/
Stories
I. Allelic Spectrum of Metabolic Syndrome (ASMS) in the Northern Finland Birth Cohort 1966 (NFBC66)
Genetically homogenous Finnish population
-- Finns descend from small number of founders 4000- 2000 years ago
-- Internal migration in the 17th century created small subisolates
-- Grew rapidly with little further migration
-- Genetically homogenous sub-populations
Sabatti et al., 2009
NFBC66:-- genetic isolate that is relatively homogeneous in genetic background
(extensive LD) and environmental exposures;-- quantitative traits: no biases characteristic of case-control studies;-- birth cohort: no age as a potential confounder; longitudinal data;-- founder population: potential enrichment in damaging variants (not
pertinent for GWAS, though)-- genotypes on ~329K SNPs in 4,763 individuals (out of 12,058 live births)
Nine heritable traits (risk factors for cardiovascular disease or T2D):-- body mass index (BMI, 1); fasting serum concentrations of lipids:
triglycerides (TG), HDL and LDL (2-4); indicators of glucose homeostasis (glucose (GLU), and insulin (INS)) and inflammation (CRP) (5-7); systolic (SBP) and diastolic (DBP) blood pressure (8-9);
-- Extreme values of these traits, in combination, identify a metabolic syndrome, hypothesized to increase risks for both CVD and T2D
NFBC66 and metabolic traits
-- 31 associations to 6 traits passing a 5x10-7 threshold after correction, mostly replicating earlier findings;
-- 9 previously unreported associations
-- “Five of these associations—HDL with NR1H3 (LXRA), LDL with AR and FADS1-FADS2, glucose with MTNR1B and insulin with PANK1— implicate genes with known or postulated roles in metabolism”;
-- the currently identified loci, singly and cumulatively, explain littleof the trait variability in NFBC1966 (at most ~6% based on multivariateregression);
-- contribution of rare variants?
GWAS results in NFBC66
Sabatti et al., 2009
Sabatti et al., 2009
GWAS in Finnish population cohorts: known genes and environment explained little of trait variance
Sabatti et al., 2009
ASMS: preliminary evidence
ASMS sequencing: overview-- Samples: 6,121 persons: 4,447 NFBC + 835 FUSION controls + 839
FUSION cases (Finland-United States Investigation of NIDDM Genetics)
-- Regions of interest: 78 genes from 17 loci on 10 chromosomes, UTRs+coding, ~270Kbp
-- Sequencing: pools of barcoded libraries per lane; 12 for Illumina GAIIx and 18 for Illumina HiSeq 2000; mean coverage depth 31-285x
-- Data processing: BWA, single sample BAMs, independent variant calling by three centers (UMich, WashU, UCLA); extensive QC
-- Consensus sites: 2,234 consensus sites, overall concordance rate between centers was 99.96%; 1,072 singletons or doubletons; 1,697 with MAF<=0.5%
-- Annotation/prediction: MapSNPs/PolyPhen-2
Summary of variant allele frequency
Service et al., 2014
Distribution of variant types
Service et al., 2014
Service et al., 2014
Association analysis strategyPhenotypes:
-- low-density lipoprotein (LDL), high-density lipoprotein (HDL), total cholesterol (TC), triglycerides (TG), fasting glucose (FG), fasting insuline (FI);
-- residuals regressed on age, age^2, sex, oral contraceptive use, pregnancy status;
-- excluded T2D cases from fusion excluded for GLU and INS analysis
Single-variant analysis: variants with MAF>0.1% in additive genetic model; first 5 PCs as covariates; method: PLINK
Gene-level tests: non-synonymous variants with MAF<1% (from 2 to 33 per gene); methods: CMC, SKAT (with direction)
Goal: new single variant signals independent from GWAS or association at the gene level (group tests)
Association resultsInitially: 17 loci X 6 metabolic phenotypes => 39 unique locus-phenotype
combinations ( 32 for lipid measures + 6 for GLU + 1 for INS)
Results:
-- For 27 of the 39 locus-phenotype combinations, the re-sequencing analysis essentially recapitulated the results from the GWAS
-- Remaining 12 locus-phenotype associations (7 loci): new signals independent from GWAS
-- ABCA1, gene-level: 23 rare variants implicated in TC and HDL-C
-- CETP, gene-level : 4 and 4 rare NS variants assoc. with increased and decreased HDL-C
-- Protective variant His177Tyr in G6PC2 (lowering FG), FinnMAF=1.4% (vs. 0.23% in Europe);
-- Damaging rs28933094 in LIPC (hepatic lipase deficiency), FinnMAF=1.5%
Service et al., 2014
Service et al., 2014
Service et al., 2014
Why?!
-- Incomplete coverage for some loci-- Causal non-coding variants?-- Indels, CNVs etc (complicated architecture)?-- Epistatic interactions?-- Compound heterozygotes?
-- Extensive rare variation in the human population
-- GWAS DNA-seq transition: knowing full coding SNV spectrum may not give immediate answers
Lessons from ASMS story
Harvard Medical School: Jeremiah Scharf, Dongmei Yu UCLA: Giovanni Coppola, Nelson Freimer, Alden Huang, Jae-Hoon Sul, Renee Sears, Vasily Ramenskiy; U.Chicago: Nancy Cox, Vasa Trubetskoy, Lea Davis
II.Tourette
syndrome in large pedigrees and independent samples
Tourette
syndrome (TS)
-- an inherited neuropsychiatric disorder with onset in childhood, characterized by multiple physical (motor) tics and at least one vocal (phonic) tic
-- ~0.4%-3.8% of children ages 5 to 18 may have TS
-- extreme TS in adulthood is a rarity, and TS does not adversely affect intelligence or life expectancy
TS/CT chr
2p linkage region in pedigrees
Dongmei Yu, Jeremiah Scharf
Tourette
syndrome Large Family sequencing by CIDR (2011)Samples: 15 pedigrees, 109 samples: 66 affected, 35 not affected, 8 unknown
Exome sequencing: Agilent HumanExon 50Mb Kit, >100 K SNVs
Custom targeted sequencing: 5.7 Mbp from chr2 (1-91 Mbp): ~22K SNVs-- known and predicted exons not on the Agilent exome kit; -- additional, brain-specific transcripts and AS exons (derived from UCLA fetal and adult brain RNA-sequencing libraries); -- alternative brain-specific TSS tags using a brain cap-analysis gene expression (CAGE) library; -- putative promoter regions;-- predicted splice sites; -- conserved sequences derived from alignments with 44 vertebrate species
•
Single-Variant
Analysis‣
EMMAX
‣
EIGENSTRAT‣
PLINK-TDT
•
Gene-Based Tests ‣
PLINK/SEQ methods
‣
VAAST‣
Zhu-Xiong
method (?)
•
Imputed Data•
CNV
Analysis
Analysis Plan Global Local
•
Perfect
Cosegregation•
Whole Dataset
•
Under Linkage Peaks•
Regions from Literature
•
Multiple-Hit Analysis•
Family-based VAAST
•
De novo
Analysis
Data in Web-Based Database
Manhattan plot of GWAS meta-analysis (Dongmei Yu)
-- Genome-wide significant result in the linkage region
-- Significant SNPs are located in the lncRNA gene
Expression correlation with top hit gene
-- BrainSpan database: expression values for 48,582 genes in 237 experiments, prenatal states only (total: ~53K in 524 exp.); gene should have >0 expression in at least one experiment
-- Pearson correlation coefficient calculated for all gene pairs in prenatal samples
-- List of genes with expression in developing brain correlated with the query gene
-- Compares a gene list against background of ~49K genes
-- Check 1-tail p<0.01 positive correlation: 476 genes
-- Check 1-tail p<0.001 negative correlation: 259 genes
GO terms in 476 genes (positive, p<0.01)
GO terms in 259 genes (neg. corr., p<0.001)
-- “Wnt1 has also been shown to antagonize neural differentiation and is a major factor in self-renewal of neural stem cells. This allows for regeneration of nervous system cells, which is further evidence of a role in promoting neural stem cell proliferation”
-- Sample sizes
-- GWAS is not dead
-- Non-coding RNA genes
Lessons from TS story: what matters?
III. Analysis of WGS variation in the genomic region associated with amygdala
volume in bipolar family individuals
UCLA Bipolar projectNelson FreimerSusan ServiceScott FearsCarrie Bearden+ many others
Bipolar disorder
-- A severe psychiatric illness, characterized by alternating episodes of depression and mania, -- Ranks among the top ten causes of morbidity and life- long disability world-wide-- Prevalence: 1-2% of the population
Sequencing and variant calling1) Initial WGS and variant calling: Illumina
-- 450 individuals from 27 large families (67 trios, 78 married-ins)2) Genotype recalling at high-quality segregating sites: Samtools
-- 24.6 mln variants in 450 individuals-- Average genotype concordance with genotyped SNPs per individual: 99.78%; Mendelian inconsistency rate in trios: 1.78%3) Pedigree-based genotype refinement: TrioCaller (Jae-Hoon Sul)-- 23 mln variants in 450 individuals-- Genotype concordance: 99.86%; Mendelian inconsistencies: 0.18%; 4) Imputation on chr6: PLINK, FamLDCaller (Jae-Hoon Sul)
-- 977K variants on chr6 after QC in 839 individuals
-- No singletons, no sites with >=5 discordant genotypes, threshold r^2=0.1
Multisystem component phenotypes of bipolar disorder (Fears et al, 2014)
-- 169 quantitative neurocognitive, temperament-related, and neuroanatomical phenotypes that appear heritable and associated with severe BP, measured in 738 adults (181 affected);-- About 25% of the phenotypes, including measures from each phenotype domain, were both heritable and associated with BP-I
// Susan Service
Amygdala
volume associated regionRegion: chr6 144-155 Mbp, table1.022814.bed:
Intergenic 23,353Noncoding 8,279Protein-coding32,090Total 63,722-------------------------------------------
Protein-coding (62 genes):Benign 169Damaging 86Exon 1062Flank 1104Intron 29473Nonsense 3Splice-site 1Synon 191-------------------------------------------
-- Burden test with effect direction (over-dispersion)
-- Earlier method SKAT (Sequence Kernel Association Test) modified to work with family samples
-- OK for quantitative phenotypes
Bioinformatics: potential functional variants
Coding variants: 3 nonsense, splice site2 damaging1 benign
Non-coding variants (accumulated):+1if conserved or accelerated in any available lineage+0.5 if Active/Strong chromatin in 10 brain tissues+0.25 if disrupts TF binding site
-- Protein coding AND “my rank”>0: ~16 K variants-- MAF<10%
Gene Pvalue Nvar Pos,Mbp----------------------------------LATS1 0.001005 138 150.01RAET1G 0.003231 14 150.24CNKSR3 0.004171 30 154.73UST 0.004190 494 149.23PPIL4 0.005438 97 149.85
famSKAT
results: take 1 (Susan Service)Variants:-- Protein coding genes-- MAF<10% in married ins-- Rank>0
Bioinformatics: functional variants for take 2
-- Priority: nonsense >> splice-site >> damaging >> benign >> synonymous >> UTR exon >> flank >> enhancer >> intron
-- New: “Enhancer”, FANTOM5 enhancers associated with a gene
-- New: GWAVA scores (0..1)
-- Components for “my rank”: conservation, TFBS overlap, active chromatin
Gene Pvalue Nvar Pos,Mbp----------------------------------LATS1 0.000074 57 150.01PPIL4 0.001620 51 149.85TFL 0.003353 8 149.79UST 0.004600 367 149.23SYNE1 0.013834 667 152.66
Take 2 + indels vs. Take 2: top 5 genesGene Pvalue Nvar Pos,Mbp----------------------------------LATS1 0.000164 45 150.01PPIL4 0.000606 46 149.85TFL 0.003353 8 149.79UST 0.004549 364 149.23SYNE1 0.012377 640 152.66
-- New statistical genetics methods needed
-- Non-coding variants and indels in protein-coding genes?
Lesson from BP story: what matters