150224 giab 30 min generic slides

Preview:

Citation preview

Genome in a Bottle: So you’ve sequenced a genome – how well did

you do?

February 2015

Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

Whole genome sequencing technologies disagree about 100,000’s of variants

3,198,316 (80.05%)

125,574 (3.14%)

Platform #1

Platform #2

Platform #3

230,311 (5.76%)

121,440 (3.04%)

208,038 (5.21%)

71,944 (1.80%)

39,604 (0.99%)

# SNPs (% of SNPs detected

by any platform)

Bioinformatics programs also disagree

O’Rawe et al. Genome Medicine 2013, 5:28

NIST-hostedGenome in a Bottle Consortium

• Infrastructure for performance assessment of NGS– support science-based regulatory

oversight

• No widely accepted set of metrics to characterize the fidelity of variant calls from NGS…

• Genome in a Bottle Consortium is developing standards to address this…– well-characterized human genomes

as Reference Materials (RMs)• characterized and disseminated by NIST

– tools and methods to use these RMs• Global Alliance for Genomics and

Health Benchmarking Team

http://genomeinabottle.org

Genome in a Bottle Consortium Development

• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011

• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011

• Small, invitational workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash

U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others

– developed draft work plan– April 2012

• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford

• Website– www.genomeinabottle.org

Others working in this space…

Well-characterized genomes

• Illumina Platinum Genomes

• CDC GeT-RM

• Korean Genome Project

• Human Longevity, Inc.

• Hyditaform mole haploid cell line

• Genome Reference Consortium

Performance Metrics

• Global Alliance for Genomics and Health Benchmarking Team

• NCBI/CDC GeT-RM Browser

• GCAT website

NIST Plays a Role in the First FDA Authorization for Next-Generation Sequencer

November 20, 2013

Measurement Process

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials will be developed to characterize performance of a part of process– materials will be

certified for their variants against a reference sequence, with confidence estimates

gen

eric

me

asu

rem

en

t p

roce

ss

Analyticalsteps

Pre-Analyticalsteps

ClinicalInterpretation

• NIST worked with GIAB to select genomes

• Current genomes

– NA12878 HapMapsample as Pilot sample• part of 17-member

pedigree

– 2 trios from PGP • Ashkenazim

• Asian

12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893

CEPH Utah Pedigree 1463

Putting “Genomes” in Bottles

11 children

NIST Human Genome RMs in the pipeline

• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,

homogeneous references suitable for use in regulated applications

– all genomes also available from Coriell repository

• Pilot Genome– ~8400 tubes

• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent

• Asian Trio– ~10000 son; parents not yet

planned as NIST RM

Goals for Data to Accompany RM

• ~0 false positive AND false negative calls in confident regions

• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)

• Avoid bias towards any particular platform– take advantage of strengths of each platform

• Avoid bias towards any particular bioinformatics algorithms

11

Pilot Genome: Integrate 12 14 Datasets from 5 platforms

12

Dat

aset

#1

Dat

aset

#2

Dat

aset

#3

Annotation #1Histogram

(e.g., coverage)

Dat

aset

#1

Dat

aset

#2

Dat

aset

#3

Annotation #2Histogram

(e.g., strand bias)

Site A

Site B

PotentialBias

Site C

Dataset Site A Site B Site C

Dataset #1 0/0 0/0 1/1

Dataset #2 0/1 0/1 1/1

Dataset #3 0/0 0/1 1/1

Integration 0/0 0/1 Uncer-tain

Candidate variants

Concordant variants

Find characteristics

of bias

Arbitrate using evidence of

bias

Confidence Level

Integration Methods to Establish Benchmark Variant Calls

Integration Methods to Establish Benchmark Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence LevelZook et al., Nature Biotechnology, 2014.

Assigning confidence to genotypes

High-confidence sites

• Sequencing/bioinformatics methods agree or we understand the biases causing disagreement

• At least some methods have no evidence of bias

• Inherited as expected

Less confident sites

• In a region known to be difficult for current technologies

• State reasons for lower confidence

• If a site is near a low confidence site, make it low confidence

Challenges with assessing performance

• All variant types are not equal

• All regions of the genome are not equal

• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance

• Genotypes fall in 3+ categories (not positive/negative)

– standard diagnostic accuracy measures not well posed

16

Challenge in variant comparison: Complex variants have multiple correct representations

BWA

ssaha2

CGTools

Novo-align

Ref:

T insertion

TCTCT insertion

17

FP SNPs FP MNPs FP indels

Traditionalcomparison

0.38% (610)

100% (915)

6.5% (733)

Comparison with realignment

0.15% (249)

4.2% (38)

2.6% (298)

Global Alliance for Genomics and HealthBenchmarking Task Team

• Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Initial focus on germline SNPs/indels• Developing benchmarking tools

• Comparison engine• Pluggable web interface with

modules for:• Reporting/calculation of metrics• Visualization/user interface

• Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes

www.bioplanet.com/gcat

Example User Interface

Stratifying Performance

• Measure performance for different types of variants in different sequence contexts– Types of variants

• SNPs• indels of different sizes• complex variants• structural variants

– Sequence contexts• Homopolymers, • STRs• Duplications

– Functional context• Exome vs genome, etc

– Data characteristics• Coverage• Mapping quality

• Challenge of smaller gene panels vs genome sequencing– one RM may not have a

sufficient number of examples of different classes of variants or sequence contexts

– likely need more samples with specific types of variants

NCBI/CDC GeT-RM Browser• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

• Allows visualization of questionable calls

Initial uses of high-confidence NIST-GIAB genotypes for NA12878

• NIST have released several versions of high-confidence genotypes for its pilot RM

• These data are presently being used for benchmarking

– prior to release of RMs

– SNPs & indels• ~77% of the genome

Using Genome in a Bottle calls to benchmark clinical exome sequencing

at Mount Sinai School of Medicine

“We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”

Benchmarking somatic variant callingat Qiagen

Implications of Technical Accuracy in Medical Genome Sequencing

• Collaboration with EuanAshley group at Stanford

• What is accuracy for functional variants?

• How much of the exomefalls in high confidence regions?

• “Black list” in databases

• Sensitivity – WExS (95%) < WGS (98%)

• especially splicing

– genome < nonsyn < syn

– Most exome FNs caused by low coverage

– Most WGS FNs cause by filtering

• Only 81 % of ClinVarpathogenic or likely pathogenic SNPs fall in high-confidence regions– Lots of work to do!

Overview of NIST RM DevelopmentGenome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015

HG-001/NA12878(“Pilot” Genome)

Release NIST RM8398; Preliminary large deletions

RefinedStructural Variants

HG-002 to HG-004 (Ashkenazim trio)

Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability

Preliminary SNPs/indels; 120x-150x PacBio data; “moleculo”;mate-pair; CG-LFR

Refined SNPs/indels; Preliminary SVs

RefinedStructural Variants

NIST RMs 8391/8392 release

HG-005 (son in Asian trio)

Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability

“moleculo”;mate-pair; CG-LFR

Preliminary SNPs/indels

Refined SNPs/indels; RefinedStructural Variants

NIST RM8393release

Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Good for…

Illumina Paired-end

150x150bp ~300x/individual

Fastq on ftp SNPs/indels/some SVs

Illumina Long Mate pair

~6000 bp insert ~40x/individual Feb-Mar 2015 SVs

Illumina “moleculo”

Custom library ~30x by long fragments

Feb-Mar 2015 SVs/phasing/assembly

Complete Genomics

100x/individual On ftp SNPs/indels/some SVs

Complete Genomics

LFR ?? SNPs/indels/phasing

Ion Proton Exome 1000x/individual

On SRA SNPs/indels in exome

BioNanoGenomics

Feb 2015 SVs/assembly

PacBio ~10kb reads ~120-150x on AJ trio

Finished ~Mar 2015

SVs/phasing/assembly/STRs

Asian PGP trio

• Similar sequencing to Ashkenazim trio except for PacBio

• Only son will be NIST RM

Future Directions

Germline mutations

• Difficult regions/variants– Long-read technologies

– Forming an analysis group

• Tools for assessing performance– How to stratify performance

and understand biases?

Somatic mutations

• Pilot interlaboratory study to assess comparability of spike-ins

• Commercial members developing FFPE cell lines

• Participants interested in mixing different RMs

How to get involved• Use our integrated

SNP/indel genotypes for NA12878 and give us feedback– Cells and DNA currently

available from Coriell– NIST RM available April

2015

• Join our new Analysis group– Use Long-read

technologies– Structural Variant calls– De novo assembly– Help create the best-ever

characterized trio

• Attend our biannual workshops (January in CA, August in MD)

• Develop tools/metrics with Global Alliance for Genomics and Health Benchmarking Team

Acknowledgments

• FDA – Elizabeth Mansfield, HPC staff

• HSPH

• GCAT - David Mittelman, Jason Wang

• Francisco De La Vega

• Illumina - Mike Eberle

• Personalis - Deanna Church

• NCBI – Chunlin Xiao

• Celera - Andrew Grupe

• Genome in a Bottle– www.genomeinabottle.org

– New members welcome!

– Sign up for email newsletters

– jzook@nist.gov