Upload
genome-reference-consortium
View
352
Download
0
Tags:
Embed Size (px)
Citation preview
GRC Workshop at AGBT 2015
Tina Graves-Lindsay
CHM1 PacBio Data and Initial Assembly Stats
• 54X Whole Genome Coverage in long reads
• 8.8kb Avg read length
• P5-C3 Chemistry
• PacBio Assembly done by Jason Chin
• Initial assembly had 4.5 MB N50 contig length
• Have alignments of PacBio CHM1 assembly to CHM1_1.1 and
GRCh38
PacBio CHM1 Assembly potentially fills GRCh38 Gaps
GRCh38
PacBio CHM1
Data exists in PacBio unitig, not present in GRCh38
CHM1_1.1 WGS Assembly Contigs
PacBio Assembly Contig
Alignment of CHM1 PacBio assembly to CHM1_1.1
BioNano Genome Map confirms assembly of PacBio Contig
PacBio Assembly Contig
BioNano Genome Map Contigs
1q21
1q21 patch alignment to chromosome 1
1q32 1q21 1p21
SRGAP2 Region in PacBio Asssembly
1q21
CHM1 Falcon vs MHAP Assembly Stats
• MHAP assembly Available for download – GCA_000772585.3
Falcon Assembly MHAP
Number of Contigs 5528 3434
N50 Contig Length 5,460,023 4,320,471
Total Assembly Size 2,818,296,359 2,828,300,545
CHM1 Assemblies – More on the Way
• MHAP Assembly
• Done by Adam Phillippy
• 1-2 more assemblies will be generated
• Dazzler Assembly
• Gene Myers version
• Longer contig N50 length
• Believe we will be evaluating it, but haven’t seen it yet
• Falcon Assemblies
• Jason Chin generating 1-2 additional Falcon assemblies using
improved software
CHM1 Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
finished BAC paths
• Assembly Assembly alignments will be generated between each PB
assembly and Illumina-based CHM1 assembly as well as GRCh38
• BioNano Genome Map
• SV calls generated from comparing the map data to each of the
CHM1 assemblies
• Alignment of the Illumina reads back to the CHM1
assemblies
• Heterozygous calls are likely indicative of a collapse in the
assembly
The Platinum Genome
• What is it?• Contiguous
• Haplotype-resolved representation of entire genome
• Best assembly from mini-assemblethon will be picked and improved
• BAC clone paths will be incorporated into PacBio whole genome assembly
• Comparison back to CHM1_1.1 to see if portions of the Illuminaassembly will fill in any gaps
• Pick additional BACs to cover regions of the assembly that are still very fragmented
CHM13 – 2nd Platinum Genome
• CHM13 – another hydatidiform mole sample
• PacBio data generated
• 60X data was generated using P5 and P6 Chemistry
• Avg read length ~11kb, longer than CHM1 data
• Data available in SRA
• Generating Illumina coverage to use for assembly QA, SV
detection, and consensus base error correction
• Plan to use BACs to improve the assembly where needed
• Alignment of Assembly to BioNano Genome map
• Currently ~91% of CHM13 assembly aligns to BioNano map
contigs
CHM13 Assembled by DNAnexus
• DNAnexus is a cloud-based genome informatics & data
management platform that enables:
• Large scale genomic analysis
• Easy and secure collaboration of data
• Governance and compliance
• Simple deployment of your own code or use of pre-packaged tools
• DNAnexus packaged FALCON so that it can be run without
complicated installation and at scale.
• DNAnexus gives access to massive computational resources
on-demand.
• During assembly of CHM13 FALCON made use of 350
concurrent workers and 1400 concurrent cores.
DNAnexus FALCON Pipeline
CHM13 – 2nd Platinum Genome
Stats PacBio DNAnexus
Number of Contigs 2873 2203
N50 12,981,785 11,909,487
N90 2,100,287 1,745,715
N95 743,427 808,675
Max Contig Length 63,148,543 53,079,926
Total Sequence 2,851,367,788 2,809,672,639
Total Assembly Time 5 days 41 hours
Refseq Analysis
GRCh38 CHM1_1.1 MHAP
CHM1
PacBio
CHM1
CHM13
Number of
sequences
not aligning
21 88 67 67 125
Split
Transcripts8 35 1245 1131 285
CDS coverage
<95%17 266 1339 1212 265
Total Sequences Retrieved from Entrez 49680
Future Directions
• Improve assemblies of both CHM1 and CHM13 to result in a
completely resolved final assembly for each genome
• From both assemblies, add significant structural variants
to the reference as alternate loci
• Sequence additional genomes to add even more diversity
to the reference from more underrepresented populations