25
Development & applications of a segregation-phasing ground truth Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc. GENOME-IN-A-BOTTLE WORKSHOP

140127 rtg phased pedigree analyses

Embed Size (px)

Citation preview

Page 1: 140127 rtg phased pedigree analyses

Development & applications of a segregation-phasing ground truth

Francisco M. De La Vega, D.Sc.Visiting Scholar, Department of GeneticsStanford University School of Medicine

In collaboration with Real Time Genomics, Inc.

G E N O M E - I N - A - B O T T L E W O R K S H O P

Page 2: 140127 rtg phased pedigree analyses

Evaluating Variant Calls

O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 5, 28 (2013).

Page 3: 140127 rtg phased pedigree analyses

Beyond Venn Diagrams

Experimental validation (e.g. Sanger, qPCR) Expensive Limited by platform success Statistical sampleReference orthogonal data available for some genomes SNP array data Sparse fosmid sequencing data IncompleteReference genomes sequenced by multiple platforms Arbitration methods (e.g. NIST, Genome-in-a-Bottle) Low FP, but unknown FN (genome-wide) Biases?

Page 4: 140127 rtg phased pedigree analyses

Mendelian segregation as “ground truth”

Page 5: 140127 rtg phased pedigree analyses

CEPH/Utah Pedigree 1463

Sequenced by CGI and Illumina (Platinum Genomes)Started with 2x100bp 50X WGS Illumina Platinum data Aligned & variant called with rtgVariant 1.1, filter by quality score (AVR≥0.15)

across the samples, excluding problematic sites

Page 6: 140127 rtg phased pedigree analyses

Example: Heterozygous variant segregation

Page 7: 140127 rtg phased pedigree analyses

Segregation of heterozygous variants to offspring

1 2 3 4 5 6 7 8 9 10 110

20,000

40,000

60,000

80,000

SNV

# of offspring segregating

SNV

coun

t

1 2 3 4 5 6 7 8 9 10 110

100

200

300

400

500

MNP

# of offspring segregating

MN

P co

unt

1 2 3 4 5 6 7 8 9 10 110

2,000

4,000

6,000

8,000

10,000

indel

# of offspring segregating

inde

l co

unt

1 2 3 4 5 6 7 8 9 10 110

20,000

40,000

60,000

80,000

100,000

All Variants

# of offspirng segregating

Varia

nt co

unt

Page 8: 140127 rtg phased pedigree analyses

Steps for haplotype phasing in large family

Check calls vs haplotype framework

Connect haplotype islands

Phase contiguity extension

Identify crossovers

Page 9: 140127 rtg phased pedigree analyses

Phasing labels given parent and child genotypes

Parents   Children      fa/fb ma/mb        

0/0 0/1 0/0 0/1    

    fa/ma fa/mb    

    fb/ma fb/mb    

0/1 0/1 0/0 0/1 1/1  

    fa/ma fb/ma fb/mb  

      fa/mb    

0/0 1/2 0/1 0/2    

    fa/ma fa/mb    

    fb/ma fb/mb    

0/1 1/2 0/1 0/2 1/1 1/2

    fa/ma fa/mb fb/ma fb/mb

0/1 2/3 0/2 0/3 1/2 1/3

    fa/ma fa/mb fb/ma fb/mb

Page 10: 140127 rtg phased pedigree analyses

Identification of recombination crossoversChr 1 Mother

Chr 6, Mother

Page 11: 140127 rtg phased pedigree analyses

Recombination crossovers statistics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220

5

10

15

20

25

30

35

40

45

Father Mother

Total: 686

Page 12: 140127 rtg phased pedigree analyses

Linking of phased regionsChr 1, Mother

Chr 6, Mother

Page 13: 140127 rtg phased pedigree analyses

Testing for Phase Consistency

PhasingLabels

Father Mother Offspring 1 Offspring 2 Offspring 3 Offspring 4

fa fb ma mb fa ma fa mb fb ma fb mb

Genotypes 0/1 0/1 0/1 0/0 1/1 0/1

Phasings

0 1 0 1 0 0 0 1 1 0 1 10 1 1 0 0 1 0 0 1 1 1 01 0 0 1 1 0 1 1 0 0 0 11 0 1 0 1 1 1 0 0 1 0 0

Genotypes 0/0 0/1 0/0 0/1 0/0 0/1

Phasings

0 0 0 1 0 0 0 1 0 0 0 1

0 0 1 0 0 1 0 0 0 1 0 0

Example with 4 offspring

Page 14: 140127 rtg phased pedigree analyses

Given that there are d different genotypes across both the parents and children and that the number of times each of these genotypes occurs is ni and , then the probability is:

Probability of a set of genotypes being phase-consistent by chance

Cleary, J. G., et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. bioRxiv (2014). doi:10.1101/001958

Page 15: 140127 rtg phased pedigree analyses

Probability of a set of genotypes being phase-consistent by chance – some examples

Genotype Counts

0/0 0/1 1/1 0/2 1/2 Probability

    13     1

  13       3.01x10-1

6 7       1.01x10-2

1 12       1.11x10-1

1 11 1     1.36x10-2

4 4 5     5.53x10-4

  3 3 3 4 6.13x10-5

  1 3 3 12 3.68x10-1

1 5   6 1 2.75x10-4

1 11   13 1 7.46x10-2

Page 16: 140127 rtg phased pedigree analyses

Phasing consistent variants

Call Set

Raw AVR >0.15

n % n %

Phase consistent 5,224,138 77.35 4,606,574 99.28

Phase inconsistent 1,329,189 19.68 13,951 0.30

Repaired 200,450 2.96 19,197 0.41

Calls insidephased segments 6,753,777 99.99 4,639,722 99.99

Illumina 2x100 bp 50X WGS Data, RTG Trio Calls

Y-chromosome excluded

Page 17: 140127 rtg phased pedigree analyses

Phasing consistent variants

Call Set

Raw VQSR 1st Tranche

n % n %

Phase consistent 6,941,213 68.34 5,863,035 96.00

Phase inconsistent 2,263,975 22.29 184,169 3.01

Repaired 951,682 9.36 59,592 0.97

Calls insidephased segments 10,156,870 99.53 6,106,796 99.98

Illumina 2x100 bp 50X WGS Data, BWA/GATK UG v1.7 Calls

Y-chromosome excluded

Page 18: 140127 rtg phased pedigree analyses

ROC curve: NA12878 vs Phased-Consistent

RTG sorted by AVR; GATK sorted by VQSLOD (1st tranche)

Page 19: 140127 rtg phased pedigree analyses

NIST GiaB arbitration vs Phase-Consistent

Confident regions

Genome-wide

Page 20: 140127 rtg phased pedigree analyses

Assessment of score recalibration models

rtgVariant v 1.1; NA12878

Page 21: 140127 rtg phased pedigree analyses

21

Assessment of MNP & indel calling (rtgVariant 1.0)

• In rtgVariant 1.0, longer insertions have higher FP than small and deletions.

• More FP in MNP• Improvements in

aligner for v1.2

Deletions Insertions

SNV/MNPs

0.5%

Percentage of phase inconsistent calls

rtgVariant v 1.0; NA12878

Page 22: 140127 rtg phased pedigree analyses

Summary & Perspectives

• Genetic segregation in a large family offers a unique opportunity to identify “true” sets of variants

• Requires collecting data for whole family as new chemistries and platforms become available (e.g. 2x250bp, Moleculo reads)

• Data from multiple platforms can be merged to create a comprehensive phase-consistent ground truth

• Allows rational assessment of variant pipelines and improvement of algorithms

• Some issues that need to be dealt with: cell line artifacts, CNVs, systematic errors, SVs.

Page 23: 140127 rtg phased pedigree analyses

rtgTools v1.0

A toolkit to compare and analyze VCFs

• vcfeval – comparison of VCFs for ROC curves • rocplot – draw ROC curves from vcfeval output• medelian – counts of Mendelian inheritance errors in pedigrees• vcfstats – basic statistics of VCF files• vcffilter – filtering of VCFs by scores, etc.• vcfannotate – annotation of VCF files• vcfmerge – merge VCF files

Java compiled code freely available at GiaB repository:

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/

Page 24: 140127 rtg phased pedigree analyses

http://biorxiv.org/content/early/2014/01/24/001958

Page 25: 140127 rtg phased pedigree analyses

Acknowledgements

RTG, Hamilton, New Zealand John Cleary Ross Braithwaite Len TriggRTG, San Bruno, CA Sahar Malakshah Minita ShahMichael Eberle, Illumina, Inc. – Platinum Project dataComplete Genomics, Inc. – CEPH pedigree dataJustin Zook – NIST

Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab)

This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA.

© 2014 Real Time Genomics, Inc. All rights reserved.