53
Computational Systems Biology Deep Learning in the Life Sciences 1 6.802 6.874 20.390 20.490 HST.506 Haoyang Zeng Lecture 16 April 9, 2019 http://mit6874.github.io Identifying genetic variants causal for traits and diseases

Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

ComputationalSystemsBiologyDeepLearningintheLifeSciences

�1

6.8026.87420.39020.490HST.506

HaoyangZengLecture16April9,2019

http://mit6874.github.io

Identifyinggeneticvariantscausalfortraitsanddiseases

Page 2: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Today’slecture

• Fundamentalsofheritability• Genome-wideassociationstudy(GWAS)• Predictingfunctionalvariantsusingmachinelearning

Page 3: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Part1-Fundamentalsofheritability

Page 4: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

GenotypetoPhenotype

• Genotype– Completegenomesequence(oranapproximation)– Canbedefinedbymarkersatspecificgenomicsitesthatdescribe

differenceswithadefinedreferencegenome• Aphenotypeisdefinedbyoneormoretraits

• Non-quantitativetrait(dead/alive,etc.)• QuantitativeTrait

– Fitness(growthrate,lifespan,etc.)– Morphology(height,etc.)– Geneexpression

• QuantitativeTraitLoci–Geneticmarkerthatisassociatedwithaquantitativetrait– eQTL–markerassociatedwithgeneexpression

Page 5: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Binaryhaploidgeneticmodel1 2 N 1 2 N

X

1 2 N

Present No No No Yes Yes Yes

Nisestimatedbylog2(#F1stested/#F1swithphenotype)

F1generation

ExamplePhenotypes Alive/Deadinaspecificenvironment

Resistanttoaspecificvirus

Supposewetested128F1s,16resistant.WhatisyourestimateofN?

Page 6: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Quantitativehaploidgeneticmodel1 2 N 1 2 N

X

1 2 N

EffectSize

0 0 0 1/N 1/N 1/N

ExamplePhenotype GrowthRate

Page 7: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Quantitativehaploidgeneticmodel1 2 N 1 2 N

X

1 2 N

p(x,N) =Nx⎛

⎝⎜

⎠⎟(1−.5)N−x.5x

σ x2 = N / 4

EffectSize

0 0 0 1/N 1/N 1/N

E[x]= N / 2

ExamplePhenotype GrowthRate

E[y]=1/ 2

σ y2 =1/ (4N)

y= x / N

Page 8: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Phenotypeisafunctionofgenotypeplusanenvironmentalcomponent

• i–individualin[1..N]• gi–genotypeofindividuali

• pi–quantitativephenotypeofindividuali(singletrait)

• ei–environmentalcontributiontopi

Page 9: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Phenotypeisafunctionofgenotypeplusanenvironmentalcomponent

• i–individualin[1..N]• gi–genotypeofindividuali

• pi–quantitativephenotypeofindividuali(singletrait)

• ei–environmentalcontributiontopi

ip = f ig( )+ ie

p2σ = g

2σ + e2σ + 2 ge

2σ E ie[ ]= 0 E 2e⎡⎣ ⎤⎦= e2σ

p2σ = g

2σ + e2σgandeassumedormadeindependentyields

p2σ =1N

2

ip − pµ( )i=1

N∑

Page 10: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Whytwoheritabilities?

• Broad-sense– Describestheupperboundforphenotypicpredictionbyanoptimalarbitrarymodel

– Revealscomplexityofmolecularmechanism• Narrow-sense

– Describestheupperboundforphenotypicpredictionbyalinearmodel

– Describesrelativeresemblanceandutilityoffamilydiseasehistory

– Efficientgeneticmappingstudies

Page 11: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Keycaveats

• Heritabilityisapropertyofpopulation(segregatingallelefrequencies)andenvironment(noisecomponent)

• “Heritability”inpracticemayrefertoeitherbroad-ornarrow-sense(oranimplicitassumptionthattheyarethesame)

• Estimationisdifficult(matchingenvironmentsandavoidingconfounding)

Page 12: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

H2-BroadSenseheritability

• Fractionofphenotypicvarianceexplainedbygeneticcomponent

• Canestimateσe2fromidenticaltwinsorclones.

2H = g2σ

p2σ= p

2σ − e2σ

p2σ

Page 13: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Broad heritability of a trait is fraction of phenotypic variance explained by genetic causes

37

H 2 =σ g2

σ p2 =

σ g2

σ g2 +σ e

2 =23

Page 14: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Additivemodelofphenotype

af ig( )= jβj∈QTL∑ ijg + 0β

E af ig( )⎡⎣ ⎤⎦=af 1p( )2

+ af 2p( )2

gij is marker j for individual i with values {0,1} Quantitative trait loci (QTLs) are discovered for each trait

Childrentendtomidpointofparentsforadditivetraitsastheyareexpectedtogetanequalnumberoflocifromeachparent

Page 15: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Historicalheritabilityexample

Galton,“Regressiontowardsmediocrityinhereditarystature”(1886)

Page 16: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

h2-NarrowSenseheritability

• Fractionofphenotypicvarianceexplainedbyanadditivemodelofmarkers

• fa(gi)isadditivemodelofgenotypiccomponentsingi• Differencebetweenheritabilityexplainedbyadditivemodel

andgeneralmodelisonesourceof“missingheritability”incurrentstudies

Page 17: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

h2-NarrowSenseheritability

• Fractionofphenotypicvarianceexplainedbyanadditivemodelofmarkers

• fa(gi)isadditivemodelofgenotypiccomponentsingi• Differencebetweenheritabilityexplainedbyadditivemodel

andgeneralmodelisonesourceof“missingheritability”incurrentstudies

ip = af ig( )+ ie a2σ = p

2σ −1N

2

ip − af ig( )( )i=1

N∑

2h = a2σ

p2σ

Page 18: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Exampletraitheritabilities

h2fromVisscheretal.2008

Page 19: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Part2-Genome-wideassociationstudy(GWAS)

Page 20: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

AG

Time

Animation:ItsikPe’er,Columbia

Page 21: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

AG

HealthycontrolsDiseasecases

Page 22: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

HealthycontrolsDiseasecases

AG

GGAAAA

AAGGGGGG

Associationbetweengenotypeandphenotype

Page 23: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

SlidecourtesyofDavidAltshuler,HMS/Broad

Mendeliantraitsarecausedbyasinglegene

Page 24: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

ContingencyTables– χ2test

Allele Cases(withAMD)

Controls(withoutAMD)

T o t a lAlleles

C a b a+b

T c d c+d

TotalAlleles

a+c b+d a+b+c+d

Df=(2rows-1)x(2columns-1)=1

E1 =a+b( ) a+ c( )a+b+ c+d( )

X 2 =Oi −Ei( )

2

Eii=1

n

AMD=Age-relatedMacularDegeneration

Page 25: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

11

SNPrs10611701238individualswithAMDand934controls2172individuals/4333alleles

Allele Cases(withAMD)

Controls(withoutAMD)

T o t a lAlleles

C 1522(a) 670(b) 2192

T 954(c) 1198(d) 2152

TotalAlleles

2476 1868 4344

22 ( ) ( )( )( )( )( )ad bc a b c da b c d b d a c

χ− + + +

=+ + + +

Df=(2rows-1)x(2columns-1)=1X2=279

P-value=1.2x10-62

Page 26: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

ContingencyTables–Fisher’sExactTest

Allele Cases(withAMD)

Controls(withoutAMD)

T o t a lAlleles

C a b a+b

T c d c+d

TotalAlleles

a+c b+d a+b+c+d

p=

a+ba

⎝⎜⎜

⎠⎟⎟

c+dc

⎝⎜⎜

⎠⎟⎟

a+b+ c+da+ c

⎝⎜⎜

⎠⎟⎟

Sumallprobabilitiesforobservedandallmoreextremevalueswithsamemarginaltotalstocomputeprobabilityofnullhypothesis

Page 27: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

13

SNPrs10611701238individualswithAMDand934controls2172individuals/4333alleles

Allele Cases(withAMD)

Controls(withoutAMD)

T o t a lAlleles

C 1522(a) 670(b) 2192

T 954(c) 1198(d) 2152

TotalAlleles

2476 1868 4344

p(a,b,c,d) =

a+ba

⎝⎜⎜

⎠⎟⎟

c+dc

⎝⎜⎜

⎠⎟⎟

a+b+ c+da+ c

⎝⎜⎜

⎠⎟⎟

p− value= p(1522+ i, 670− i, 954− i,1198+ i)i=0

670∑

Page 28: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

DoestheaffectedorcontrolgroupexhibitPopulationStratification?

• Populationstratificationiswhensubpopulationsexhibitallelicvariationbecauseofancestry

• CancausefalsepositivesinanassociationstudyifthereareSNPdifferencesinthecaseandcontrolpopulationstructures

• ControlforthisartifactbytestingcontrolSNPsforgeneralelevationinχ2distributionbetweencasesandcontrols

Page 29: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

LinkageDisequilibrium(LD)betweentwolociL1andL2ingametes

AtlocusL1 pAprobabilityL1isA qaprobabilityL1isaAtlocusL2 pBprobabilityL2isB qbprobabilityL2isb

L2B L2b

L1A PAB=pApB+D PAb=pAqb-D

L1a PaB=qapB-D Pab=qaqb+D

D=Measureoflinkagedisequilibrium=0whenL1andL2areinequilibrium

D=PABPab-PAbPaB

r2=D2/(pAqapBqb)Exampler2=.69whenPABandPab=.3,PAbandPaB=.2

ris[0,1]andisthecorrelationcoefficientbetweenallelicstatesinL1andL2

Page 30: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

LDorganizesthegenomeintohaplotypeblocks

Humangenome5q31region(associatedwithInflammatoryBowelDisease)

Red–highr2White–lowr2(ignorelilac)

Page 31: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

r2fromhumanchromosome22

Page 32: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

AG

GGAAAA

AAGGGGGG

r2=1Proxy/LeadSNPs

TTCCCC

CCTTTTTT

r2=0.75

GAAAAA

AAGGGGGG

Page 33: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Part3-Predictingfunctionalvariantsusingmachinelearning

Page 34: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Whydowewanttointerpretthefunctionalconsequenceofavariant?

• NarrowthepoolofcandidatevariantsinGWAStolowerthestatisticalburdenfrommulti-hypothesistesting

• Identifythecausalvariantfromallthatareinstronglinkagedisequilibriumwiththeleadvariant

• Understandthepathologicalmechanismofcausalvariants

Page 35: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Twogeneralapproaches

• Annotation-based• AbinitiofromDNAsequence

Page 36: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Annotation-basedprediction

• Usefunctionalannotationsintheproximalregionsasfeatures

• Thefunctionalannotationsarefromfunctionalassays(eg.ChIP-seq),geneannotation,evolutionaryannotation,etc.

• Representativemethods• CADD(Kircheretal.NatureGenetics2014)• GWAVA(Ritchieetal.NatureMethods2014)

Page 37: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

GWAVA

• Annotationsincludedinthemodel• Openchromatin(DNaseIHypersensitivity)• Transcriptionfactorbinding(ChIP-seqpeakcallsfor124TFs)• Histonemodification(ChIP-seqpeakcallsfor12HM)• RNApolymerasebinding(ChIP-seqpeakcalls)• CpGisland• Genomesegmentation(predictedbySeawayandChromHMM)• Conservation(genomicevolutionaryrateprofilingformammals)• Humanvariation(Meanheterozygosityandmeanderivedallelefrequency)

• Geniccontext(distancetothenearestTSS/splicesite/generegion)• Sequencecontext(G+Ccontent,isCpG,inrepeatsequence)

Page 38: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

GWAVA

• Randomforestasthecomputationalmethod

Page 39: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

GWAVAcanclassifyregulatorymutationsfromcontrols

Page 40: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Limitationsofannotation-basedmethods

• Themajorityoftheannotationsarespecifictotheproximalregion,notthevariantitself• Falsepositivesariseifavariantresidesinanimportantregion,buthasnofunctionalconsequence

• Foranewpatient,allthefunctionalassayshavetoberedonetobeabletomakeaccuratepredictions

Page 41: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

AbinitiopredictionfromDNAsequence

• TrainacomputationalmodelthatpredictsfunctionalsignalfromDNAsequence

• Foravariant,producethepredictedfunctionalsignaloftheproximalregionforboththereferenceandthealternatealleleofthevariant

• Trainaclassifiertopredictfunctionalvariantfromthepredictedfunctionalchangetotheproximalregion

Page 42: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Example:CpGenie(Zengetal.2017)forDNAmethylationprediction

• Establishmentandmaintenanceof

qssue-specificexpressionprofiles

• X-chromosomeinacqvaqon

• Genomicimprinqng

• Transposableelementsilencing

• Celldifferenqaqon• Inflammatoryprocesses

Page 43: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Sequence-basedmethylationmodelhelpsinvariousanalysisofnon-codingsequencevariants

Page 44: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

ConvolutionalneuralnetworkforpredictingmethylationlevelofaCpGsite

Page 45: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

CpGenieaccuratelypredictsthedirectionofallelic-changeofDNAmethylation

Page 46: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

CpGenieoutperformsexistingmethodsinclassifyingmeQTLfromnon-meQTLs

Page 47: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

PredictedchangeinDNAmethylationhelpsidentifycausalvariantsfromthoseinstrongLD

Page 48: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

CpGenie’smethylationpredictionsserveasimportantfeaturesforeQTLandGWASSNPprioritization

Page 49: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Example2:EnsembleExpr(Zengetal.2017)foreQTLprediction

• Dataset• Expressionlevelofthereferenceandalternatealleleof3000+geneticvariantsmeasuredbymassiveparallelreporterassay(MPRA)

• Task• Expressionprediction

• PredicttheexpressionlevelofeachDNAsequenceexample• Predictwhichonesaresignificant(asdefinedbyaquantilecomparedtothepopulation)

• eQTLprediction• Predicttheexpressiondifferencebetweentheref.andalt.alleleofavariant

• Predictwhichvariantsshowsignificantdifferenceinexpressionbetweenalleles(eQTL)

Page 50: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Example2:EnsembleExpr(Zengetal.2017)foreQTLprediction

Page 51: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

EnsembleExprachievedthebestperformanceintheCAGI4eQTLchallenge

Page 52: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

EnsembleExproutperformedexistingmethodsforeQTLprediction

Page 53: Computational Systems Biology Deep Learning in the Life ... › assets › slides › 6.874-lecture-16-2019.pdf · Computational Systems Biology Deep Learning in the Life Sciences

Summaryoftoday’slecture

• Fundamentalsofheritability• Genome-wideassociationstudy(GWAS)• Predictingfunctionalvariantsusingmachinelearning