Grc workshop agbt2015_tg

GRC Workshop at AGBT 2015

Tina Graves-Lindsay

CHM1 PacBio Data and Initial Assembly Stats

• 54X Whole Genome Coverage in long reads

• 8.8kb Avg read length

• P5-C3 Chemistry

• PacBio Assembly done by Jason Chin

• Initial assembly had 4.5 MB N50 contig length

• Have alignments of PacBio CHM1 assembly to CHM1_1.1 and

GRCh38

PacBio CHM1 Assembly potentially fills GRCh38 Gaps

GRCh38

PacBio CHM1

Data exists in PacBio unitig, not present in GRCh38

CHM1_1.1 WGS Assembly Contigs

PacBio Assembly Contig

Alignment of CHM1 PacBio assembly to CHM1_1.1

BioNano Genome Map confirms assembly of PacBio Contig

PacBio Assembly Contig

BioNano Genome Map Contigs

1q21

1q21 patch alignment to chromosome 1

1q32 1q21 1p21

SRGAP2 Region in PacBio Asssembly

1q21

CHM1 Falcon vs MHAP Assembly Stats

• MHAP assembly Available for download – GCA_000772585.3

Falcon Assembly MHAP

Number of Contigs 5528 3434

N50 Contig Length 5,460,023 4,320,471

Total Assembly Size 2,818,296,359 2,828,300,545

CHM1 Assemblies – More on the Way

• MHAP Assembly

• Done by Adam Phillippy

• 1-2 more assemblies will be generated

• Dazzler Assembly

• Gene Myers version

• Longer contig N50 length

• Believe we will be evaluating it, but haven’t seen it yet

• Falcon Assemblies

• Jason Chin generating 1-2 additional Falcon assemblies using

improved software

CHM1 Assembly Assessment Methods

• Assemblies will run through NCBI QA pipeline

• Assessed for contiguity, annotation, and concordance with the

finished BAC paths

• Assembly Assembly alignments will be generated between each PB

assembly and Illumina-based CHM1 assembly as well as GRCh38

• BioNano Genome Map

• SV calls generated from comparing the map data to each of the

CHM1 assemblies

• Alignment of the Illumina reads back to the CHM1

assemblies

• Heterozygous calls are likely indicative of a collapse in the

assembly

The Platinum Genome

• What is it?• Contiguous

• Haplotype-resolved representation of entire genome

• Best assembly from mini-assemblethon will be picked and improved

• BAC clone paths will be incorporated into PacBio whole genome assembly

• Comparison back to CHM1_1.1 to see if portions of the Illuminaassembly will fill in any gaps

• Pick additional BACs to cover regions of the assembly that are still very fragmented

CHM13 – 2nd Platinum Genome

• CHM13 – another hydatidiform mole sample

• PacBio data generated

• 60X data was generated using P5 and P6 Chemistry

• Avg read length ~11kb, longer than CHM1 data

• Data available in SRA

• Generating Illumina coverage to use for assembly QA, SV

detection, and consensus base error correction

• Plan to use BACs to improve the assembly where needed

• Alignment of Assembly to BioNano Genome map

• Currently ~91% of CHM13 assembly aligns to BioNano map

contigs

CHM13 Assembled by DNAnexus

• DNAnexus is a cloud-based genome informatics & data

management platform that enables:

• Large scale genomic analysis

• Easy and secure collaboration of data

• Governance and compliance

• Simple deployment of your own code or use of pre-packaged tools

• DNAnexus packaged FALCON so that it can be run without

complicated installation and at scale.

• DNAnexus gives access to massive computational resources

on-demand.

• During assembly of CHM13 FALCON made use of 350

concurrent workers and 1400 concurrent cores.

DNAnexus FALCON Pipeline

CHM13 – 2nd Platinum Genome

Stats PacBio DNAnexus

Number of Contigs 2873 2203

N50 12,981,785 11,909,487

N90 2,100,287 1,745,715

N95 743,427 808,675

Max Contig Length 63,148,543 53,079,926

Total Sequence 2,851,367,788 2,809,672,639

Total Assembly Time 5 days 41 hours

Refseq Analysis

GRCh38 CHM1_1.1 MHAP

CHM1

PacBio

CHM1

CHM13

Number of

sequences

not aligning

21 88 67 67 125

Split

Transcripts8 35 1245 1131 285

CDS coverage

<95%17 266 1339 1212 265

Total Sequences Retrieved from Entrez 49680

Future Directions

• Improve assemblies of both CHM1 and CHM13 to result in a

completely resolved final assembly for each genome

• From both assemblies, add significant structural variants

to the reference as alternate loci

• Sequence additional genomes to add even more diversity

to the reference from more underrepresented populations

Health & Medicine

Grc workshop agbt2015_tg