33
genomeinabottl e.org Genome in a Bottle Consortium August 2015 NIST, Gaithersburg, MD Reference Materials for Clinical Applications of Human Genome Sequencing Marc Salit, Ph.D. and Justin Zook, Ph.D National Institute of Standards and Technology

Giab aug2015 intro and update 150821.pptx

Embed Size (px)

Citation preview

Page 1: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Genome in a Bottle Consortium August 2015

NIST, Gaithersburg, MD

Reference Materials for Clinical Applications of Human Genome Sequencing

Marc Salit, Ph.D. and Justin Zook, Ph.DNational Institute of Standards and Technology

Page 2: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

NIST Released the GIAB Pilot Genome

as RM 8398 in May 2015

Page 3: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

GIAB Scope

• The Genome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls.

• A principal motivation for this consortium is to enable performance assessment of sequencing and science-based regulatory oversight of clinical sequencing.

Page 4: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Genome in a Bottle Consortium Development

• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011

• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011

• Small workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash

U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others

– developed draft work plan– April 2012

• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford– August 2015 at NIST– January 28-29, 2015 at Stanford

• Website– www.genomeinabottle.org

Page 5: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Well-characterized, stable RMs• Obtain metrics for validation,

QC, QA, PT• Determine sources and types of

bias/error• Learn to resolve difficult

structural variants• Improve reference genome

assembly• Optimization

– integration of data from multiple platforms

– sequencing and analysis• Enable regulated applications Comparison of SNP Calls for

NA12878 on 2 platforms, 3 analysis methods

Page 6: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

NGS Validation Process usingGenomes in Bottles

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

Analytical ProcessGenome in a Bottle Scope

Pre-Analytical Process

Clinical InterpretationGIAB Data

Page 7: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Genome in a Bottle Consortium (GIAB)Hosted by US National Institute of Standards and Technology

Goal: Provide infrastructure for performance assessment of NGS

• Appropriately consented widely available DNA samples, distributed by the Coriell Institute– Also, QCed Reference Material (RM)

versions from controlled lots will be available from NIST

– Pilot NIST RM 8398: tinyurl.com/giabpilot

• High-accuracy reference data for these samples

• Tools to facilitate their use– With the Global Alliance Data Working

Group Benchmarking Team

ga4gh.org

Page 8: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

High-confidence SNP/indel calls

Zook et al., Nature Biotechnology, 2014.

• methods to develop SNP/indel call set described in manuscript

• broad and quick adoption of call set for benchmarking– struck nerve

Page 9: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Highlights

This workshop• Progress Update• Breakouts

– Analyses for PGP GIAB Trios– Other RMs

• GIAB Roadmap– Coordinating analyses– Other RM plans– Papers?

• Using GIAB Products for analytical validation of clinical NGS assays

Future GIAB work• Beyond support,

improvement/development and maintenance of existing GIAB products…– What future work should

GIAB do that would uniquely take advantage of the momentum we’ve built?

Page 10: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

AgendaThursday• Welcome and Status Update• Break• Breakout presentations

– Analysis Team– Other Reference Materials

• Lunch (on your own in cafeteria)

• GIAB Roadmap• Break• Breakouts to plan to carry out

the roadmap• Plenary to discuss Roadmap

plans

Friday• Additional Analysis breakout

if needed• Using GIAB products for

Analytical Validation• Break• GIAB products for analytical

validation?• Lunch (on your own in

cafeteria)• Steering committee meeting

Page 11: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

AgendaMonday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions

– Topic #1: Moving beyond the 'easy' variants and regions of the genome

– Topic #2: Selecting future genomes for Reference Materials

Tuesday• Breakfast and registration• Use cases: Experiences using the pilot

Reference Material• Discussion of plans to release pilot

Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans

and discussion• Steering committee Overview• First meeting of the Steering

Committee (others adjourn)

Please Note

Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).

Tweets are welcome unless the speaker requests otherwise. Please use #giab as the hashtag.

Page 12: Giab aug2015 intro and update 150821.pptx

GIAB Roadmap: Where are we, Where are we going?

• Reference Materials– Germline– Somatic

• Informatics– Analysis of GIAB data– Benchmarking

• Documentary Standards/Publications– Documentation of methods– Supporting Use

Page 13: Giab aug2015 intro and update 150821.pptx

GIAB

Germline Genomes

Pilot RM High-confidence SNPs/indels RM Release High-confidence

SVs

PGP RMs

High-confidence SNPs/indels RM Release High-confidence

SVs

Other ancestries

Do we need trios?

Other large families?

Sample panelsMany samples with clinically important

mutationsPharmacogenomics

In depth analysesCharacterize harder

parts of the genome

Diploid de novo assemblies

Assign confidence scores to variants

in RMs

Somatic mutation RMs

Interlaboratory study

ctDNA/cfDNA/fetal DNA

Whole cancer genomes

Benchmarking tools

Define performance

metrics

Stratification - Assign confidence

to types of variants

Documents/Publications Analyses

Best practices/analytic

validation

Documentary standards

Page 14: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Others working in this space…

Well-characterized genomes• Illumina Platinum Genomes• CDC GeT-RM• Korean Genome Project• Human Longevity, Inc.• Hyditaform mole haploid

cell line• Genome Reference

Consortium• 1000 Genomes SV group

Performance Metrics• Global Alliance for

Genomics and Health Benchmarking Team

• NCBI/CDC GeT-RM Browser• GCAT website

Page 15: Giab aug2015 intro and update 150821.pptx

What should GIAB do?

• Beyond support, improvement/development and maintenance of existing in--process GIAB products…– What future work should GIAB do that would take

advantage of the momentum and unique community we’ve built?

Page 16: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

GIAB Progress Update

August 2015

Page 17: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

NIST Human Genome Reference Materials (RMs)

• NIST RM 8398 is available!– tinyurl.com/giabpilot– DNA isolated from large

growth cell cultures– Stable, homogeneous – Best for regulated uses– DNA from same cell line at

Coriell (NA12878)

• New AJ and Asian Samples– Available from Coriell now– NIST RM available in 2016

Page 18: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Using high-confidence NIST-GIAB genotypes for NA12878

• NIST have released several versions of high-confidence genotypes for its pilot RM

• These data are presently being used for benchmarking– prior to release of RMs– SNPs & indels

• ~77% of the genome•Data on FTP now well-organized

Page 19: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

90000

Page 20: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call

Page 21: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Uses of GIAB NA12878

Oncology – Molecular and Cellular Tumor Markers“Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection

www.bioplanet.com/gcat

Page 22: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Global Alliance for Genomics and HealthBenchmarking Task Team

• Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Initial focus on germline SNPs/indels• Developing benchmarking tools

• Comparison engine• Pluggable web interface with

modules for:• Reporting/calculation of metrics• Visualization/user interface

• Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes

www.bioplanet.com/gcat

Example User Interface

Page 23: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Global Alliance for Genomics and HealthBenchmarking Task Team

Credit: Rebecca Truty, Complete Genomics

How should we interpret this complex variant on chr21?

Page 24: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Global Alliance for Genomics and HealthBenchmarking Task Team

Credit: Rebecca Truty, Complete Genomics

Beyond simple T/F classification: Genotype errorsTruth

Callset

Description ProposedName(s)

CM#1 region match

CM#2 allele match CM#3 genotype match

0/1 1/1 zygosity/genotype error

GE TP 1TP, 1GE FN

1/1 0/1

1/2 0/11/10/22/2

common allele, FN allele

GE_FN TP 1TP, 1GE, 1FN FN

0/1 1/2 common allele, FP allele

GE_FP TP 1TP, 1GE, 1FP FP, FN

1/1 1/2

1/2 1/3 common allele, FP allele, FN allele

GE_FP_FN TP 1TP, 1GE, 1FP, 1FN

FP, FN

Page 25: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Global Alliance for Genomics and HealthBenchmarking Task Team

Credit: Rebecca Truty, Complete Genomics

Beyond simple T/F classification: no-calls and half-calls

Truth Callset Description ProposedName(s)

CM#1 region match

CM#2 allele match CM#3 genotype match

0/1 ./1 half-call, TP allele HC_TP NC, NCV, TP 1NC, 1NCV, 1TP, 1GE TP

1/1 ./1 1NC, 1NCV, 1TP, 1GE FN

0/11/1

./0 half call, FN allele(s)

HC_FN NC, NCV, TP 1NC, 1NCV, 1FN FN

1/2 ./0 1NC, 2NCV, 2FN FN

1/2 ./1./2

half-call, TP allele, FN allele

HC_TP_FN

NC, NCV, TP 1NC, 1NCV, 1TP, 1GE, 1FN

FN

Page 26: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Stratifying False PositivesGC ContentTR

Unit <7

TRUnit >=7

TRUnit

2TRUnit

1

TRUnit

3

TRUnit

4

Credit:Abby BeelerEllie Wood

GA4GH - Stratification

Page 27: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Data from GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…

Illumina Paired-end

150x150bp ~300x/individual on SRA/FTP SNPs/indels/some SVs

Illumina Long Mate pair

~6000 bp insert ~20x/individual on FTP SVs

Illumina “moleculo”

Custom library ~20-30x by long fragments

on FTP SVs/phasing/assembly

Complete Genomics

100x/individual On SRA/ftp SNPs/indels/some SVs

Complete Genomics

LFR on SRA/FTP SNPs/indels/phasing

Ion Proton Exome 1000x/individual On SRA/FTP SNPs/indels in exome

BioNano Genomics

200-250kbp optical map reads

~100x/AJ individual; 57x on Asian son

Raw reads and assemblies on FTP

SVs/assembly

10X Linked reads 30-45x/individual On FTP SVs/phasing/assembly

PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent

on SRA/FTP SVs/phasing/assembly/STRs

Page 28: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

GIAB Analysis Group – New Data Sets

Leaders• Francisco de la Vega

– Annai Systems• Chris Mason

– Weil Cornell Medical Center• Tina Graves

– Washington University• Valerie Schneider

– NCBI•and Justin and Marc

Status• Analysis Group Responsibilities:

– https://docs.google.com/document/d/10eA0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXHhtNH1uzw/edit?usp=sharing

• Analysis Milestones:– https://docs.google.com/spreadsheets/d/1Pj4nSzH742g4

0wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing

• Analysis Methods– https://docs.google.com/spreadsheets/d/1Je2g85

H7oK6kMXbBOoqQ1FMNrvGnFuUJTJn7deyYiS8/edit?usp=sharing

• Analysis Plan:– https://drive.google.com/file/d/0B7Ao1qqJJDHQdn

VEaVdqbWdEdkE/view?usp=sharing

• Collecting Data into a Central FTP Site• Recruiting people to help with the work.

This could be you.We need volunteers!

Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios

Page 29: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Data Release Plans: Real-time, Open, Public Release

Individual Datasets• Uploaded to GIAB FTP site

as it is collected• Includes raw reads, aligned

reads, and variant/reference calls

Integrated High-confidence Calls• First develop SNP, indel, and

homozygous reference calls• Then develop SV and non-

SV calls• Released calls are versioned• Preliminary callsets will be

made available to be critiqued

Page 30: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

SNP/Indel Integration Method Update

• Implementing refined integration methods on DNAnexus– Others can readily reproduce results– Consistent results for all GIAB genomes

• Validating with released NA12878 RM data– Planned completion Sep 2015

• Then, apply to PGP trios– Plan to analyze AJ trio by Nov 2015– Release of NIST RMs in early 2016

Page 31: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Integration to form high-confidence SNP/indel calls

VCFs with 0 FP PASS and 0 FN PASS+filtered in

BED files

If 1+ datasets PASS and all PASSing datasets have

same genotype

High-confidence variant, include in high-

confidence regions

If all datasets are filtered or outside BED

Unless manually inspect alignments: not high-

confidence, exclude +-50 bp from high-confidence

regions

If PASSing datasets disagree about genotype

or variant

Unless manually inspect alignments: not high-

confidence, exclude +-50 bp from high-confidence

regions

If inside BED and not in VCF for 1+ datasets, and no datasets have PASSing

variants

High-confidence region

Page 32: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Forming high-confidence calls on AJ Trio

Generate candidate calls with multiple analysis methods from

multiple types of data

Compare/integrate candidate calls and manually inspect data to

understand differences; refine calls?

Generate integrated calls with several methods (MetaSV,

Parliament, svclassify, others?)

Combine integrated calls (with heuristics and/or machine learning)

to generate high-confidence calls

https://docs.google.com/spreadsheets/d/1Pj4nSzH742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing

August 30, 2015

Nov 1, 2015

Dec 1, 2015

Jan 26, 2016

Page 33: Giab aug2015 intro and update 150821.pptx

genomeinabottle.org

Analysis Progress: AJ Trio• SNPs/indels

– Several candidate callsets– NIST working on integration

• Assembly– 2 de novo assemblies of AJ trio (MHAP and Falcon/Bionano)– Will be used by at least 2 groups for SV calling

• Structural variants– Candidate calls being generated by 14+ groups with >14 different

algorithms and 6 datasets– 3 integration methods: MetaSV, Parliament, svclassify

• Long-range Phasing– 2 phased calls so far (CG LFR and 10X)– Integration methods needed!