25
The importance of high quality reference genome assemblies to personal and medical genomics Karyn Meltz Steinberg Genome Informatics 2015 @KMS_Meltzy

The importance of high quality reference genome assemblies to personal and medical genomics

Embed Size (px)

Citation preview

The importance of high quality reference genome assemblies to personal and medical genomics

Karyn Meltz Steinberg Genome Informatics 2015

@KMS_Meltzy

0

100000

200000

300000

400000

CHM1_1.1 HuRef ALLPATHS YH_2.0

Contig Number

Contig N50

Figure 1Last year…

Steinberg et al, 2014

This year…

0

5000000

10000000

15000000

20000000

25000000

30000000

CHM13 Draft

CHM1 PB_2

CHM1 PB_1

CHM1_1.1 HuRef ALLPATHS YH_2.0

Contig Number

Contig N50

This year…

Log scale

1

10

100

1000

10000

100000

1000000

10000000

100000000

CHM13 Draft

CHM1 PB_2

CHM1 PB_1

CHM1_1.1 HuRef ALLPATHS YH_2.0

Contig Number

Contig N50

We combine PacBio with other technologies to construct the assembly

How do we define platinum and gold standards?

GRCh38 Platinum (CHM1)

Gold (NA19240)

% Reference genome covered 100 98.40 90.80

% Assigned chromosomes 99.60 98.40 90.80

% gene models covered (>95% id, >90% length) 99.96 98.78 94.26

Contig N50 67.8 Mb 26.9 Mb 6.0 Mb

Number of gaps 875 3,640 3,568

Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb

% haplotype blocks (>1kb) resolved NA >95 >80

http://genome.wustl.edu/projects/detail/reference-genomes-improvement/

CHM13 Draft Assembly (GCA_000983455.1)

•  60X PacBio (P5 and P6 chemistry) •  Average read length ~11kb •  Daligner/Falcon v 0.2

Total sequence length 2,851,367,788

Number of contigs 2,873

Contig N50 12,981,785

Contig L50 68

CHM13 Hybrid Scaffold Hybrid Scaffold

PacBio Contigs

BioNano Contigs

CHM13 Hybrid Scaffolds Improve Contiguity

BioNano Map PacBio Assmbly Hybrid Scaffold

# of Contigs 3593 1590 * 254

Min Contig Length 0.08 Mb 0 0.27 Mb

Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb

Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb

Contig N50 1.02 Mb 12.98 Mb 20.79 Mb

Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb

Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb

*Number of contigs used in hybrid scaffolding

BioNano can be used to size gaps and identify structural variants

Colla

pse

Expa

nsio

n in

Ass

embl

y

Gap in Sequence PacBio Assembly

BioNano Map

SV_TYPES  DELETIONS   41  INVERSIONS   10  INSERTIONS   15

TOTAL   66  

BioNano alignment to CHM13

BioNano reveals collapse in PacBio assembly

PacBio Assembly

BioNano Map

Illumina data aligned to PacBio assembly also shows collapse

BioNano reveals collapse in PacBio assembly due to highly homologous segmental duplications

SD = 96%

CHR1   46746040   46857004   40   W   LBHZ01000938.1   110965  

CHR1   46857005   47034202   41   N   177198   gap  

CHR1   47034203   52157695   42   W   LBHZ01000245.1   5123493  

PacBio Assembly

BioNano Map

This region is rich in medically relevant genes

chr1 (p33) p31.1 1q12 q41 43 44

CYP4Z2P

CYP4A11

CYP4X1

CYP4Z1

CYP4A22

SegDups

Genes

CHM13

PacBio

LBHZ010000938.1 LBHZ010000938.1

LBHZ010000245.1

This locus has an assigned GRC issue due to unresolved variation and may be a candidate locus for alternative representation in the reference

Reference based Analyses

•  100X Illumina sequence from CHM13 •  Align to GRCh37 and GRCh38 with BWA-MEM •  Variant calling via SpeedSeq (Chiang et al,

2015) •  SNVs, indels: FreeBayes •  SVs: LUMPY, SVTyper •  CNV: CNVnator

Similar number of variants per chromosome

GRCh37.p15

GRCh38.p2

Similar annotation of variants

GRCh37.p15

GRCh38.p2

GRCh37.p15

GRCh38.p2

SRGAP2 region resolved in GRCh38

Patch alignment to chromosome 1

1q32 1q21 1p21

GRCh37.p15

GRCh38.p2

PRIM2 region resolved in GRCh38

tl;dpa*

•  The reference genome assembly is constantly being improved

•  New PacBio-based assemblies are orders of magnitude

more contiguous than previous WGS assemblies •  Integration of other data (e.g. BioNano, Dovetail) can

improve contiguity even further and be used to identify structurally variant haplotypes that can be added to reference as alternative loci

•  Platinum genome sequences integrated into GRCh38

have greatly improved read mapping and variant calling

*too long; didn’t pay attention

Acknowledgements

The McDonnell Genome Institute at Washington University in St. Louis

Rick Wilson Bob Fulton Wes Warren Tina Graves-Lindsay Vince Magrini Sean McGrath Derek Albracht Milinn Kremitzki Susan Rock Debbie Scheer Aye Wollam

The Finishing and Bioinformatics Teams at The Genome Institute

University of Washington Evan Eichler John Huddleston Archana Raja

NCBI Valerie Schneider

University of Pittsburgh School of Medicine (CHM13 cell line)

Urvashi Surti

Personalis Deanna Church

BioNano Genomics Palak Sheth

Pacific Biosciences Jason Chin Nick Sisneros