43
Haplotype resolved structural varia1on assembly with long reads Mount Sinai: Ali Bashir, Oscar Rodriguez, MaAhew Pendleton Reed College: Anna Ritz, Alex Ledger

Haplotype resolved structural variation assembly with long reads

Embed Size (px)

Citation preview

Haplotyperesolvedstructuralvaria1onassemblywithlongreads

MountSinai:AliBashir,OscarRodriguez,MaAhewPendleton

ReedCollege:AnnaRitz,AlexLedger

Overview

•  Background– AutomatedHybridAssembly

•  PhasedDiploidAssembly– Exisi1ngLimita1onofassembly– 10X+PacBio

•  IssueswithcallingSVsandcomparingdatasets

Pacbio:44X,~4.9kbavg.length BioNano:80X,278kbmeanspan

TechnologiesofGreaterScaleNeededtoStudyHumanComplexity

Sequence Contigs

Contig Maps

Insilicadiges1on

Genome Maps

Aligned Contig-Scaffold Pairs

NGS

DeNovoAssemble

AlignGenomeMapstoCon6gs/Con6gstoGenomeMaps

ScaffoldGraphConstruc6onandLayout

A.Pang M.Pendleton

BioNanoRaw Molecules

Hybrid Scaffolds

Schema1cforHybridScaffolding

Sequence Contigs

Contig Maps

Insilicadiges1on

Genome Maps

Aligned Contig-Scaffold Pairs

NGS

DeNovoAssemble

AlignGenomeMapstoCon6gs/Con6gstoGenomeMaps

ScaffoldGraphConstruc6onandLayout

AP.ang M.Pendleton

BioNanoRaw Molecules

Hybrid Scaffolds

Scaffoldingissymmetric

Schema1cforHybridScaffolding

Sequence Contigs

Contig Maps

Insilicadiges1on

Genome Maps

Aligned Contig-Scaffold Pairs

NGS

DeNovoAssemble

AlignGenomeMapstoCon6gs/Con6gstoGenomeMaps

ScaffoldGraphConstruc6onandLayout

AP.ang M.Pendleton

BioNanoRaw Molecules

Hybrid Scaffolds

InconsistenciescanbeflaggedtobreakoreliminateConHgsorGenomeMaps

Schema1cforHybridScaffolding

AsanHcipated,heterochromaHcregionsaremostdifficulttospan-Completesasmanyas28gapsinhg38-OthermethodsanddatatypescanbeusedtofurtherresolveaddiHonalgapsintheassembly

HybridAssemblyBetweenBioNano&PacBioSupercon1gs

HybridAssemblyBetweenBioNano&PacBioSupercon1gs

AsanHcipated,heterochromaHcregionsaremostdifficulttospan-Completesasmanyas28gapsinhg38-OthermethodsanddatatypescanbeusedtofurtherresolveaddiHonalgapsintheassembly

OrthogonalErrorProfilesenabledrama1cimprovements

hg19chr1

Superscaffold

hg38chr1

Hg19gap Hg19gap

Superscaffold

Moleculepileup

ComplexStructuralRearrangementscanbeValidatedRela1vetoCurrentReferences

S1llresolvesatleast28gapsinhg38assemblyfor>400kbinpredictedgapintervals

Manylargestructuralvariantspredictedliketheabove.Aretheyreal?

hg19chr1

Superscaffold

hg38chr1

Hg19gap Hg19gap

Superscaffold

Moleculepileup

ComplexStructuralRearrangementscanbeValidatedRela1vetoCurrentReferences

S1llresolvesatleast28gapsinhg38assemblyfor>400kbinpredictedgapintervals

VeryLargeTRExpansionsDetectedviaOp1malMapLPA(KringleIVExonExpansion–EachPeriod>5kb)

ExampleSignaturesofComplexEvents

MostInversionsFromthe1000GenomesProjectAreNotActuallyInversions!

•  GIAB–  Long-readsequencingofTrios

•  20-30Xparents•  40-70Xchildren

–  10XChromiumData–  AshkenaziJewish

•  IkG–  Long-readsequencing

•  ~20-25Xparents•  ~45-50Xparents

–  10XChromiumData•  Chinese,PuertoRican,andYorbuan

ancestry

TechnologyCon1nuestoMarchForwardGIABAJand1000GenomesTrios

SubRLN50=11,087bpTotal#Bases=220Gb#ofReads=27.4Mreads

CurrentGenomeAssemblyResultsonGIABand1kGTriosSample Contigs Average N50 Max Total Size HG002 13231 230 kb 4.1 Mb 31.6 Mb 3.04 Gb HG003 17873 172 kb 4.6 Mb 21.5 Mb 3.08 Gb HG004 16487 185 kb 5.3 Mb 22.6 Mb 3.05 Gb

HG00512 23146 117 kb 369 kb 2.6 Mb 2.72 Gb HG00513 18443 151 kb 401 kb 2.4 Mb 2.78 Gb HG00514 11517 264kb 7.2 Mb 61.1Mb 3.04Gb HG00731 20811 132 kb 451 kb 3.8 Mb 2.74 Gb HG00732 13672 214 kb 1.3 Mb 10.9 Mb 2.93 Gb HG00733 11143 281kb 11.4 Mb 57.4 Mb 3.14 Gb NA19238 56480 39 kb 70 kb 645 kb 2.20 Gb NA19239 73478 23 kb 40 kb 1.01Mb 1.71 Gb NA19240 15245 203 kb 3.8 Mb 20.1 Mb 3.09 Gb

JoyceLee(BioNanoGenomics)

Sample Enzyme Contigs N50 Total Size HG00733 (Genome Map) BspQI 2185 4.2 Mb 5.6 Gb HG00733 (Genome Map) BssSI 5038 1.39 Mb 5.1 Gb

HG00733 (Hybrid) BspQI 133 / 10637 56.4 Mb / 52.2 Mb 2.8 Gb HG00733 (Hybrid) BssSI 234 / 10749 40.26 Mb / 29.7 Mb 2.8 / 3.2 Gb HG00733 (Hybrid) BspQI + BssSI 104 / 10590 72.6 Mb / 61.5 Mb 2.9 Gb / 3.2 Gb

HybridAssemblywithmul1pleEnzymesdrama1callyimprovescon1guityandcoverageofthegenome

CurrentGenomeAssemblyResultsonGIABand1kGTriosSample Contigs Average N50 Max Total Size HG002 13231 230 kb 4.1 Mb 31.6 Mb 3.04 Gb HG003 17873 172 kb 4.6 Mb 21.5 Mb 3.08 Gb HG004 16487 185 kb 5.3 Mb 22.6 Mb 3.05 Gb

HG00512 23146 117 kb 369 kb 2.6 Mb 2.72 Gb HG00513 18443 151 kb 401 kb 2.4 Mb 2.78 Gb HG00514 11517 264kb 7.2 Mb 61.1Mb 3.04Gb HG00731 20811 132 kb 451 kb 3.8 Mb 2.74 Gb HG00732 13672 214 kb 1.3 Mb 10.9 Mb 2.93 Gb HG00733 11143 281kb 11.4 Mb 57.4 Mb 3.14 Gb NA19238 56480 39 kb 70 kb 645 kb 2.20 Gb NA19239 73478 23 kb 40 kb 1.01Mb 1.71 Gb NA19240 15245 203 kb 3.8 Mb 20.1 Mb 3.09 Gb

JoyceLee(BioNanoGenomics)

HybridAssemblywithmul1pleEnzymesdrama1callyimprovescon1guityandcoverageofthegenome

Sample Enzyme Contigs N50 Total Size HG00733 (Genome Map) BspQI 2185 4.2 Mb 5.6 Gb HG00733 (Genome Map) BssSI 5038 1.39 Mb 5.1 Gb

HG00733 (Hybrid) BspQI 133 / 10637 56.4 Mb / 52.2 Mb 2.8 Gb HG00733 (Hybrid) BssSI 234 / 10749 40.26 Mb / 29.7 Mb 2.8 / 3.2 Gb HG00733 (Hybrid) BspQI + BssSI 104 / 10590 72.6 Mb / 61.5 Mb 2.9 Gb / 3.2 Gb

PreviousWorkinNA12878:HybridAssembly/Varia1onAnalysisPipeline

Pendletonetal.,NatureMethods2010

M.Pendleton

A.Pang

J.Chin

PreviousWorkinNA12878:HybridAssembly/Varia1onAnalysisPipeline

Pendletonetal.,NatureMethods2010

M.Pendleton

A.Pang

J.Chin

PreviousWorkinNA12878:HybridAssembly/Varia1onAnalysisPipeline

Pendletonetal.,NatureMethods2010

M.Pendleton

A.Pang

J.Chin

SummaryOf1kG/GIABPacBioSVCalls

*RanthroughastreamlinedversionoftheSVpipelinemaynotbecomparableOtherNotes:-  MEIcallsuseconserva1veparameters(likelyundercallinginser1ons)-  “Other”callscontainsomeimproperlyflaggedinser1ons/dele1onsaswellascomplex

eventsandinversions

InserHon DeleHon Complex

Sample#ofCalls

#ofTRcalls

#ofAlu

#ofL1

#ofSVA #ofCalls

#ofTRcalls*

#ofAlu #ofL1 #ofSVA

HG002 13471 5573 325 68 7 9639 6880 798 201 22 2493HG003* 12947 5133 411 74 5 9692 6776 411 74 5 2580HG004 12769 5066 475 160 96 9509 7233 971 282 33 2599HG00512 9830 4164 366 75 67 7672 5781 768 275 23 2157HG00513 9761 4175 351 86 79 7791 5936 770 258 27 2314HG00514 1285 4866 212 42 3 9636 6770 767 222 26 2635HG00731 9874 4322 357 76 75 7678 5797 790 256 17 2174HG00732 11059 4884 400 85 85 8227 6274 813 271 24 2351HG00733* 11769 5365 330 45 4 8848 6179 743 191 25 2313NA19238 7512 2999 280 72 59 6320 4765 628 237 12 1910NA19239 5909 2357 199 46 50 5061 3809 528 161 21 1468NA19240* 13285 5185 345 78 7 9791 7596 911 275 23 2600

VennDiagramsBetweenallTrios:YRI/PUR/CHSDele1ons

Notseeingallprobandcalls(blue)intheparents-Under-callingofhets?

HaplotypeResolvedAssembly

Par11onedassemblyofPacBioreadswith10XGenomicsphasedvariantcalls

AssemblySequencesOvenMixHaplotypeInforma1on

hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng

Bubblesrepresentdivergentalleles

Blue=MaternalYellow=Paternal

hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng

Bubblesrepresentdivergentalleles

Blue=MaternalYellow=Paternal

Aconserva1veassemblywillNOTtrytolinkacrosstheblackbubbleswithoutsomesortofscaffoldinginforma1n

AssemblySequencesOvenMixHaplotypeInforma1on

hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng

Bubblesrepresentdivergentalleles

Blue=MaternalYellow=Paternal

Assembliesoventakeasinglepathinthisgraph,thiscouldmixmaternalandpaternalalleles

AssemblySequencesOvenMixHaplotypeInforma1on

10X+PacBio:HaplotypeResolvedSVs

Genome

10Xpar11onedSNVs

Long reads

OverviewofApproach

OverviewofApproach

OverviewofApproach

RevisedCallSets

*PacBiocallstakefrom“Sniffles”–developedatSchatzLab(JohnHopkins)**10XcallstakeprovidedusingLongRangerfrom10XGenomics

HybridHaplotypeSeparatedCallsetsAddManySVs

*PacBiocallstakefrom“Sniffles”–developedatSchatzLab(JohnHopkins)**10XcallstakeprovidedusingLongRangerfrom10XGenomics

CombinedCallsLargelyContainExis1ngCalls

(SnifflesSVDetec1on)

CallsRemainThatAreUniqueToASinglePlaxorm

(SnifflesSVDetec1on)

AssemblyCalls“Missed”InHaplotypeCalls

Par11onedReads

HaplotypeCon1gs

AssemblyCalls

HaplotypeCalls

HaplotypeCalls“Missed”InAssemblyCalls

Par11onedReads

HaplotypeCon1gs

AssemblyCalls

HaplotypeCalls

HaplotypeCalls“Missed”InAssemblyCalls

Par11onedReads

HaplotypeCon1gs

AssemblyCalls

HaplotypeCalls

Repeats(likeTRs)canshivboundariesforvariouscallers

ExampleWorkarounds

-Breakpointcalls(notassembledsequence): IfL1iswithin10%ofL2sizeandL1andL2arebothinRwecallita“match”-AssemblyCalls PerformMSAtoiden1fytruly“homozygous”sequences

L1

L2

TandemRepeat

ComplexhetsAreOvenin“Hardregions”

Par11onedReads

HaplotypeCon1gs

AssemblyCalls

HaplotypeCalls

AssemblyProvidesDetailedIndica1onsOfQuality

•  Providessequenceofbreakpoint•  Poten1allyprovidesco-locatedevents•  Poten1allyprovidesinforma1ononaccuracyoftheassemblyinthatregion

SlidefromJasonChinattheSMRTInforma6csWorkshop

Ctg 33

Ctg

33

map

ped

to C

hr1

Ctg 120

Mis-assembly point

Assemblyhaveaddi1onalinforma1on

Ongoing/FutureWork•  GIABisworkingonhowtointegrateassemblycallsrobustly

–  Surprisinglypooroverlap!•  Typingcalls•  Integra1ngparentalinforma1on•  Providingfulllengthhaplotypeassembliesforgenomes

–  Canbedonewithtriophasing–  But,it’snowpossiblewithnewtechnologies!

•  Hi-C•  StrandSeq

•  Integra1ngGraphsfrom10xandPacBio•  PullinginverylargeSVswithBioNano•  Movingintofullindelresolu1onandtoolsforcomparingdatasets

rapidlytoalthaps

Acknowledgements•  Mount Sinai

–  Eric Schadt –  Matt Pendleton –  Ajay Ummat –  Oscar Franzen –  Gintaras Deikus –  Robert Sebra –  Oscar Rodriguez*

•  Reed College

–  Anna Ritz –  Alex Ledger

•  UCSF –  Pui Kwok

•  PacBio –  Jason Chin

•  1000 Genomes SV Working Group

•  UW –  Mark Chaisson

•  EMBL –  Jan Korbel –  Markus H.-Y.

Fritz –  Tobias Rausch

•  BioNano Genomics –  Han Cao –  Alex Hastie –  Heng Dai –  Andy Pang –  Joyce Lee

•  10X Genomics –  Patrick Marks –  Deanna Church –  Mike Schnall-Levin –  Sofia

Kyriazopoulou-Panagiotopoulou