26
Genome in a Bottle Consortium Progress Update January 27, 2014 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

140127 GIAB update and NIST high-confidence calls

Embed Size (px)

Citation preview

Page 1: 140127 GIAB update and NIST high-confidence calls

Genome in a Bottle Consortium

Progress UpdateJanuary 27, 2014

Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

Page 2: 140127 GIAB update and NIST high-confidence calls

2

Whole Genome RMs vs. Current Validation Methods

• Sanger confirmation– Limited by number of sites (and sometimes it’s wrong)

• High depth NGS confirmation– May have same systematic errors

• Genotyping microarrays– Limited to known (easier) variants– Problems with neighboring “complex” variants, duplications

• Mendelian inheritance– Can’t account for some systematic errors

• Simulated data– Generally not very representative of errors in real data

• Ti/Tv– Varies by region of genome, and only gives overall statistic

Page 3: 140127 GIAB update and NIST high-confidence calls

3

Goals for Data to Accompany RM

• ~0 false positive AND false negative calls in confident regions

• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)

• Avoid bias towards any particular platform– take advantage of strengths of each platform

• Avoid bias towards any particular bioinformatics algorithms

Page 4: 140127 GIAB update and NIST high-confidence calls

4

Integrate 12 14 Datasets from 5 platforms

Page 5: 140127 GIAB update and NIST high-confidence calls

5

Integration of Data toForm Highly Confident Genotype Calls

Find all possible variant sites

Find concordant sites across multiple datasets

Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias

For each site, remove datasets with decreasingly atypical characteristics until all datasets agree

Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known

segmental duplications, SVs, or long repeats

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level

Page 6: 140127 GIAB update and NIST high-confidence calls

6

Verification of “Highly Confident” Genotype accuracy

• Sanger sequencing– 100% accuracy but only 100s of sites

• X Prize Fosmid sequencing– Sometimes call only part of a complex variant

• Microarrays– Differences appear to be FP or FN in arrays

• Broad 250bp HaplotypeCaller– Very highly concordant

• Platinum genomes pedigree SNPs– Some systematic errors are inherited; different representations of

complex variants• Real Time Genomics SNPs and indels

– Some interesting sites called by RTG complex caller

Page 7: 140127 GIAB update and NIST high-confidence calls

7

GCAT – Interactive Performance Metrics

• NIST is working with GCAT to use our highly confident variant calls

• Assess performance of many combinations of mappers and variant callers

• www.bioplanet.com/gcat

Improvement of FreeBayes over 1 year with indels

Page 8: 140127 GIAB update and NIST high-confidence calls

8

Why do calls differ from our highly confident genotypes?

Apparent False Positives • Platform-specific systematic

sequencing errors for SNPs• Analysis-specific • Difficult to map regions• Indels in long

homopolymers

Apparent False Negatives• Different complex variant

representation• Near indels• Inside repeats

Page 9: 140127 GIAB update and NIST high-confidence calls

9

Complex variants have multiple correct unphased representations

BWA

ssaha2

CGTools

Novo-align

Ref:

T insertion

TCTCT insertion

FP SNPs FP MNPs FP indels

Traditional comparison

0.38% (610)

100% (915)

6.5% (733)

Comparison with realignment

0.15% (249)

4.2% (38)

2.6% (298)

• ~225,000 highly confident variants are within 10bp of another variant

• FPs and FNs are significantly enriched for complex variants

• RTG vcfeval can fix this issue!

Page 10: 140127 GIAB update and NIST high-confidence calls

Reasons we exclude regions from high-confidence set

Page 11: 140127 GIAB update and NIST high-confidence calls

Reasons we exclude regions from high-confidence set

Page 12: 140127 GIAB update and NIST high-confidence calls

Depth of coverage (DOC)Control-FREECCnD

Paired-end mapping (PEM)Breakdancer

Split read (SR)Pindel

Assembly based (AS)VelvetABySS

SVMergeList of structural variant calls

CombinationGenome-STRiP

Structural variant analytical approach

Page 13: 140127 GIAB update and NIST high-confidence calls
Page 14: 140127 GIAB update and NIST high-confidence calls

• Coverage (mean and standard deviation)• Paired-end distance/insert size (mean and

standard deviation)• # of discordant paired-ends• Soft clipping of the reads (mean and

standard deviation)• Mapping quality (mean and standard

deviation)• # of heterozygous and homozygous SNP

genotype calls

Validation parameters for each SV

Page 15: 140127 GIAB update and NIST high-confidence calls

15

Challenges with assessing performance

• All variant types are not equal

• All regions of the genome are not equal– Homopolymers, STRs,

duplications– Can be similar or

different in different genomes

• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance

• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic

accuracy measures not well posed

Page 16: 140127 GIAB update and NIST high-confidence calls

16

Pedigree calls• RTG and Illumina Platinum

Genomes working on this• Sequence NA12878, husband,

and 11 children to identify high confidence variants– Identify cross-over events– Determine if genotypes are

consistent with inheritance

• Should we integrate these with the NIST high-confidence genotypes?

• Should we find larger families for future genomes?

• See afternoon presentations!

Source: Mike Eberle, Illumina

Page 17: 140127 GIAB update and NIST high-confidence calls

Pedigree Calls in Uncertain Regions

Page 18: 140127 GIAB update and NIST high-confidence calls

GIAB Characterization of pilot RM

• NIST – 300x 150x150bp HiSeq (from 6 vials)• NIST – 100x 75bp ECC SOLiD 5500W• Illumina – 50x 100x100bp HiSeq• Complete Genomics – Normal and LFR (non-

RM)• Garvan Institute – Illumina exome• NCI – Ion Proton whole genome• INOVA – Infinium SNP/CNV array

Page 19: 140127 GIAB update and NIST high-confidence calls

Homogeneity and Stability

Homogeneity• Multiplex First and last vial

– 3 libraries x 33x HiSeq each

• Multiplex 4 Random vials– 2 libraries x 12.5x HiSeq each

• Compare variability due to:– vial– library– day– flow cell– lane– sampling

• Run PFGE on each vial for size

Stability• Run PFGE to detect DNA

degradation• Freeze-thaw 2 and 5 times• Vortex for 10s• 4°C for 2 and 8 weeks• 37°C for 2 and 8 weeks

Page 20: 140127 GIAB update and NIST high-confidence calls

FTP site and Amazon S3

• NCBI is hosting fastq, bam, and vcf files on the giab ftp site

• These data are mirrored to Amazon S3, so we encourage you to take advantage of this!

Page 21: 140127 GIAB update and NIST high-confidence calls

Pilot Reference Material

• High-confidence calls are available on the ftp site and are already being used

• NIST plans to release this as a NIST Reference Material in the next couple months

Page 22: 140127 GIAB update and NIST high-confidence calls

Future Directions• Characterize more “difficult”

regions/variants• Structural variants• Compare to pedigree calls• Examine potentially clinically

relevant regions/variants in RMs• Use long-read technologies

– Moleculo– CG LFR– PacBio– BioNano Genomics– future technologies??

• Use glia/platypus to realign reads to candidate variants

• Analyze interlaboratory study data

• Characterize PGP genomes– Ashkenazim trio– son in Asian trio– DNA at NIST in Jan-Feb

2014– Volunteers to sequence?

• Select future genomes• Tumor-normal?

Page 23: 140127 GIAB update and NIST high-confidence calls

Topic #1: Moving beyond the easy regions/variants

Presentations• Emerging Technologies

– PacBio– Complete Genomics LFR– Moleculo– BioNano Genomics

• Structural Variants– Bina Technologies

Topics• Structural Variants• Phasing• Validation• Where should we set the

threshold(s) for confidence?

Page 24: 140127 GIAB update and NIST high-confidence calls

Topic #2: Cancer and Future Genomes

Cancer• Spike-ins• Mixtures of normal cell lines• Tumor-normal cell line pair• Transriptome controls

Priorities for Future Genomes• Diverse ancestry groups• Larger families• Recruitment with consent

for commercialization• How many genomes?• Should the parents be NIST

Reference Materials, or only the child?

Page 25: 140127 GIAB update and NIST high-confidence calls

Working Group Questions

RM Selection & Design• Spike-in controls• FFPE• Commercial RMs• ABRF interlaboratory study• Should we prioritize one or

two genomes?

RM Characterization• Production mode for new

trios– Pilot was characterized by

Illumina, SOLiD, Ion Proton, and Complete Genomics

– What resources should we invest in measurements for each new family?

Page 26: 140127 GIAB update and NIST high-confidence calls

Working Group Questions

Bioinformatics• Storing data/pipelines

– Suggestions for ftp structure– Data submission/accessioning

process– Data model for genomic data– Archiving pipelines and reproducible

research

• GRCh38• How to use pedigree calls for pilot

genome?• Clones for targeted regions (hard

regions if not whole genome)• In which difficult regions should

we focus our characterization?

Performance Metrics• Target audience• Requirements for user

interface– Establishing truth set(s)– Inputs/Outputs– Visualization

• Integration with GeT-RM