Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ComputationalSystemsBiologyDeepLearningintheLifeSciences
�1
6.8026.87420.39020.490HST.506
HaoyangZengLecture16April9,2019
http://mit6874.github.io
Identifyinggeneticvariantscausalfortraitsanddiseases
Today’slecture
• Fundamentalsofheritability• Genome-wideassociationstudy(GWAS)• Predictingfunctionalvariantsusingmachinelearning
Part1-Fundamentalsofheritability
GenotypetoPhenotype
• Genotype– Completegenomesequence(oranapproximation)– Canbedefinedbymarkersatspecificgenomicsitesthatdescribe
differenceswithadefinedreferencegenome• Aphenotypeisdefinedbyoneormoretraits
• Non-quantitativetrait(dead/alive,etc.)• QuantitativeTrait
– Fitness(growthrate,lifespan,etc.)– Morphology(height,etc.)– Geneexpression
• QuantitativeTraitLoci–Geneticmarkerthatisassociatedwithaquantitativetrait– eQTL–markerassociatedwithgeneexpression
Binaryhaploidgeneticmodel1 2 N 1 2 N
X
1 2 N
Present No No No Yes Yes Yes
Nisestimatedbylog2(#F1stested/#F1swithphenotype)
F1generation
ExamplePhenotypes Alive/Deadinaspecificenvironment
Resistanttoaspecificvirus
Supposewetested128F1s,16resistant.WhatisyourestimateofN?
Quantitativehaploidgeneticmodel1 2 N 1 2 N
X
1 2 N
EffectSize
0 0 0 1/N 1/N 1/N
ExamplePhenotype GrowthRate
Quantitativehaploidgeneticmodel1 2 N 1 2 N
X
1 2 N
p(x,N) =Nx⎛
⎝⎜
⎞
⎠⎟(1−.5)N−x.5x
σ x2 = N / 4
EffectSize
0 0 0 1/N 1/N 1/N
E[x]= N / 2
ExamplePhenotype GrowthRate
E[y]=1/ 2
σ y2 =1/ (4N)
y= x / N
Phenotypeisafunctionofgenotypeplusanenvironmentalcomponent
• i–individualin[1..N]• gi–genotypeofindividuali
• pi–quantitativephenotypeofindividuali(singletrait)
• ei–environmentalcontributiontopi
Phenotypeisafunctionofgenotypeplusanenvironmentalcomponent
• i–individualin[1..N]• gi–genotypeofindividuali
• pi–quantitativephenotypeofindividuali(singletrait)
• ei–environmentalcontributiontopi
ip = f ig( )+ ie
p2σ = g
2σ + e2σ + 2 ge
2σ E ie[ ]= 0 E 2e⎡⎣ ⎤⎦= e2σ
p2σ = g
2σ + e2σgandeassumedormadeindependentyields
p2σ =1N
2
ip − pµ( )i=1
N∑
Whytwoheritabilities?
• Broad-sense– Describestheupperboundforphenotypicpredictionbyanoptimalarbitrarymodel
– Revealscomplexityofmolecularmechanism• Narrow-sense
– Describestheupperboundforphenotypicpredictionbyalinearmodel
– Describesrelativeresemblanceandutilityoffamilydiseasehistory
– Efficientgeneticmappingstudies
Keycaveats
• Heritabilityisapropertyofpopulation(segregatingallelefrequencies)andenvironment(noisecomponent)
• “Heritability”inpracticemayrefertoeitherbroad-ornarrow-sense(oranimplicitassumptionthattheyarethesame)
• Estimationisdifficult(matchingenvironmentsandavoidingconfounding)
H2-BroadSenseheritability
• Fractionofphenotypicvarianceexplainedbygeneticcomponent
• Canestimateσe2fromidenticaltwinsorclones.
2H = g2σ
p2σ= p
2σ − e2σ
p2σ
Broad heritability of a trait is fraction of phenotypic variance explained by genetic causes
37
H 2 =σ g2
σ p2 =
σ g2
σ g2 +σ e
2 =23
Additivemodelofphenotype
af ig( )= jβj∈QTL∑ ijg + 0β
E af ig( )⎡⎣ ⎤⎦=af 1p( )2
+ af 2p( )2
gij is marker j for individual i with values {0,1} Quantitative trait loci (QTLs) are discovered for each trait
Childrentendtomidpointofparentsforadditivetraitsastheyareexpectedtogetanequalnumberoflocifromeachparent
Historicalheritabilityexample
Galton,“Regressiontowardsmediocrityinhereditarystature”(1886)
h2-NarrowSenseheritability
• Fractionofphenotypicvarianceexplainedbyanadditivemodelofmarkers
• fa(gi)isadditivemodelofgenotypiccomponentsingi• Differencebetweenheritabilityexplainedbyadditivemodel
andgeneralmodelisonesourceof“missingheritability”incurrentstudies
h2-NarrowSenseheritability
• Fractionofphenotypicvarianceexplainedbyanadditivemodelofmarkers
• fa(gi)isadditivemodelofgenotypiccomponentsingi• Differencebetweenheritabilityexplainedbyadditivemodel
andgeneralmodelisonesourceof“missingheritability”incurrentstudies
ip = af ig( )+ ie a2σ = p
2σ −1N
2
ip − af ig( )( )i=1
N∑
2h = a2σ
p2σ
Exampletraitheritabilities
h2fromVisscheretal.2008
Part2-Genome-wideassociationstudy(GWAS)
AG
Time
Animation:ItsikPe’er,Columbia
AG
HealthycontrolsDiseasecases
HealthycontrolsDiseasecases
AG
GGAAAA
AAGGGGGG
Associationbetweengenotypeandphenotype
SlidecourtesyofDavidAltshuler,HMS/Broad
Mendeliantraitsarecausedbyasinglegene
ContingencyTables– χ2test
Allele Cases(withAMD)
Controls(withoutAMD)
T o t a lAlleles
C a b a+b
T c d c+d
TotalAlleles
a+c b+d a+b+c+d
Df=(2rows-1)x(2columns-1)=1
E1 =a+b( ) a+ c( )a+b+ c+d( )
X 2 =Oi −Ei( )
2
Eii=1
n
∑
AMD=Age-relatedMacularDegeneration
11
SNPrs10611701238individualswithAMDand934controls2172individuals/4333alleles
Allele Cases(withAMD)
Controls(withoutAMD)
T o t a lAlleles
C 1522(a) 670(b) 2192
T 954(c) 1198(d) 2152
TotalAlleles
2476 1868 4344
22 ( ) ( )( )( )( )( )ad bc a b c da b c d b d a c
χ− + + +
=+ + + +
Df=(2rows-1)x(2columns-1)=1X2=279
P-value=1.2x10-62
ContingencyTables–Fisher’sExactTest
Allele Cases(withAMD)
Controls(withoutAMD)
T o t a lAlleles
C a b a+b
T c d c+d
TotalAlleles
a+c b+d a+b+c+d
p=
a+ba
⎛
⎝⎜⎜
⎞
⎠⎟⎟
c+dc
⎛
⎝⎜⎜
⎞
⎠⎟⎟
a+b+ c+da+ c
⎛
⎝⎜⎜
⎞
⎠⎟⎟
Sumallprobabilitiesforobservedandallmoreextremevalueswithsamemarginaltotalstocomputeprobabilityofnullhypothesis
13
SNPrs10611701238individualswithAMDand934controls2172individuals/4333alleles
Allele Cases(withAMD)
Controls(withoutAMD)
T o t a lAlleles
C 1522(a) 670(b) 2192
T 954(c) 1198(d) 2152
TotalAlleles
2476 1868 4344
p(a,b,c,d) =
a+ba
⎛
⎝⎜⎜
⎞
⎠⎟⎟
c+dc
⎛
⎝⎜⎜
⎞
⎠⎟⎟
a+b+ c+da+ c
⎛
⎝⎜⎜
⎞
⎠⎟⎟
p− value= p(1522+ i, 670− i, 954− i,1198+ i)i=0
670∑
DoestheaffectedorcontrolgroupexhibitPopulationStratification?
• Populationstratificationiswhensubpopulationsexhibitallelicvariationbecauseofancestry
• CancausefalsepositivesinanassociationstudyifthereareSNPdifferencesinthecaseandcontrolpopulationstructures
• ControlforthisartifactbytestingcontrolSNPsforgeneralelevationinχ2distributionbetweencasesandcontrols
LinkageDisequilibrium(LD)betweentwolociL1andL2ingametes
AtlocusL1 pAprobabilityL1isA qaprobabilityL1isaAtlocusL2 pBprobabilityL2isB qbprobabilityL2isb
L2B L2b
L1A PAB=pApB+D PAb=pAqb-D
L1a PaB=qapB-D Pab=qaqb+D
D=Measureoflinkagedisequilibrium=0whenL1andL2areinequilibrium
D=PABPab-PAbPaB
r2=D2/(pAqapBqb)Exampler2=.69whenPABandPab=.3,PAbandPaB=.2
ris[0,1]andisthecorrelationcoefficientbetweenallelicstatesinL1andL2
LDorganizesthegenomeintohaplotypeblocks
Humangenome5q31region(associatedwithInflammatoryBowelDisease)
Red–highr2White–lowr2(ignorelilac)
r2fromhumanchromosome22
AG
GGAAAA
AAGGGGGG
r2=1Proxy/LeadSNPs
TTCCCC
CCTTTTTT
r2=0.75
GAAAAA
AAGGGGGG
Part3-Predictingfunctionalvariantsusingmachinelearning
Whydowewanttointerpretthefunctionalconsequenceofavariant?
• NarrowthepoolofcandidatevariantsinGWAStolowerthestatisticalburdenfrommulti-hypothesistesting
• Identifythecausalvariantfromallthatareinstronglinkagedisequilibriumwiththeleadvariant
• Understandthepathologicalmechanismofcausalvariants
Twogeneralapproaches
• Annotation-based• AbinitiofromDNAsequence
Annotation-basedprediction
• Usefunctionalannotationsintheproximalregionsasfeatures
• Thefunctionalannotationsarefromfunctionalassays(eg.ChIP-seq),geneannotation,evolutionaryannotation,etc.
• Representativemethods• CADD(Kircheretal.NatureGenetics2014)• GWAVA(Ritchieetal.NatureMethods2014)
GWAVA
• Annotationsincludedinthemodel• Openchromatin(DNaseIHypersensitivity)• Transcriptionfactorbinding(ChIP-seqpeakcallsfor124TFs)• Histonemodification(ChIP-seqpeakcallsfor12HM)• RNApolymerasebinding(ChIP-seqpeakcalls)• CpGisland• Genomesegmentation(predictedbySeawayandChromHMM)• Conservation(genomicevolutionaryrateprofilingformammals)• Humanvariation(Meanheterozygosityandmeanderivedallelefrequency)
• Geniccontext(distancetothenearestTSS/splicesite/generegion)• Sequencecontext(G+Ccontent,isCpG,inrepeatsequence)
GWAVA
• Randomforestasthecomputationalmethod
GWAVAcanclassifyregulatorymutationsfromcontrols
Limitationsofannotation-basedmethods
• Themajorityoftheannotationsarespecifictotheproximalregion,notthevariantitself• Falsepositivesariseifavariantresidesinanimportantregion,buthasnofunctionalconsequence
• Foranewpatient,allthefunctionalassayshavetoberedonetobeabletomakeaccuratepredictions
AbinitiopredictionfromDNAsequence
• TrainacomputationalmodelthatpredictsfunctionalsignalfromDNAsequence
• Foravariant,producethepredictedfunctionalsignaloftheproximalregionforboththereferenceandthealternatealleleofthevariant
• Trainaclassifiertopredictfunctionalvariantfromthepredictedfunctionalchangetotheproximalregion
Example:CpGenie(Zengetal.2017)forDNAmethylationprediction
• Establishmentandmaintenanceof
qssue-specificexpressionprofiles
• X-chromosomeinacqvaqon
• Genomicimprinqng
• Transposableelementsilencing
• Celldifferenqaqon• Inflammatoryprocesses
Sequence-basedmethylationmodelhelpsinvariousanalysisofnon-codingsequencevariants
ConvolutionalneuralnetworkforpredictingmethylationlevelofaCpGsite
CpGenieaccuratelypredictsthedirectionofallelic-changeofDNAmethylation
CpGenieoutperformsexistingmethodsinclassifyingmeQTLfromnon-meQTLs
PredictedchangeinDNAmethylationhelpsidentifycausalvariantsfromthoseinstrongLD
CpGenie’smethylationpredictionsserveasimportantfeaturesforeQTLandGWASSNPprioritization
Example2:EnsembleExpr(Zengetal.2017)foreQTLprediction
• Dataset• Expressionlevelofthereferenceandalternatealleleof3000+geneticvariantsmeasuredbymassiveparallelreporterassay(MPRA)
• Task• Expressionprediction
• PredicttheexpressionlevelofeachDNAsequenceexample• Predictwhichonesaresignificant(asdefinedbyaquantilecomparedtothepopulation)
• eQTLprediction• Predicttheexpressiondifferencebetweentheref.andalt.alleleofavariant
• Predictwhichvariantsshowsignificantdifferenceinexpressionbetweenalleles(eQTL)
Example2:EnsembleExpr(Zengetal.2017)foreQTLprediction
EnsembleExprachievedthebestperformanceintheCAGI4eQTLchallenge
EnsembleExproutperformedexistingmethodsforeQTLprediction
Summaryoftoday’slecture
• Fundamentalsofheritability• Genome-wideassociationstudy(GWAS)• Predictingfunctionalvariantsusingmachinelearning