140127 rtg vcfeval vcf comparison tool

Preview:

Citation preview

Comparing Variant Calls

Francisco M. De La Vega, D.Sc.Visiting Scholar, Department of GeneticsStanford University School of Medicine

In collaboration with Real Time Genomics, Inc.

G E N O M E - I N - A - B O T T L E W O R K S H O P

rtgTools v1.0

A toolkit to compare and analyze VCFs

• vcfeval – comparison of VCFs for ROC curves • rocplot – draw ROC curves from vcfeval output• medelian – counts of Mendelian inheritance errors in pedigrees• vcfstats – basic statistics of VCF files• vcffilter – filtering of VCFs by scores, etc.• vcfannotate – annotation of VCF files• vcfmerge – merge VCF files

Java compiled code freely available at GiaB repository:

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/

3

Issues in representation of complex calls

Indel in homopolymer

Reference CAAAAAAG

Baseline C..AAAAGCalled CAAAA..G

After replay:

Baseline CAAAAGCalled CAAAAG

MNPs

Reference CAACGTAAG  Baseline CAATGTCAG Called CAATGTCAG

Issues in representation of complex calls

Dinucleotide repeat

Reference ACGTACCAGATATCACAACATATATATABaseline ACGGACCAG..ATCACAACATATATATATA

Called ACGGACCAGAT..CACAACATATATATATA

After replay: Baseline ACGGACCAGATCACAACATATATATATA Called ACGGACCAGATCACAACATATATATATA

Best path Link mutations ROC

Comparison of variant call set with baseline set

Basic rules• Match the baseline and called sequences so as to maximize true positives

and minimize false positives and false negatives.• True positives + false negatives = total calls in the baseline• Heterozygous calls match: Both heterozygous and alleles must agree

Path creation• A path is a selection of subset of calls• Best path: paths that maximize true positives and minimize errors• In theory, exponential number of paths; in practice this can be solved by

dynamic programing

Baseline

Called

a b c d e f g h

Reference

Path creation - simple homozygous case

False positive (excluded)

Baseline

Called

Best Path

False negative (excluded)

a b c d e f g h

Baseline

Called

a b c d e f g h

Reference

Path creation - simple homozygous case

Baseline

Called

a b c d e f

Reference

Path creation - simple heterozygous case (non-phased)

False positive (excluded)

Baseline

Called

Best Path

False negative (excluded)

a b c d e f

Baseline

Called

a b c d e f

Reference

Path creation - simple heterozygous case (non-phased)

Why weighting is needed?

TP + FN = Totalbaseline

Reference CAACAACTATCCTC....ATCT....GC

Baseline CAACAACTATCCTCATCTATCTATCTGC

 

Called CAACAACTATCCTCATCTATCTATCTGC

Sync points

Reference ACAGTCACGGBaseline ACGGTCACTGCalled ACGGTTACGG

Reference AC AGT CAC GGBaseline AC GGT CAC TGCalled AC GGT TAC GG

Weighting

where B is the number of baseline variants between the current (Sn) and previous sync points (Sn-1) and C is the number of called variants between the current and previous sync points.

False positive (excluded)

False negative (excluded)

1 1 1 1 1 1

Baseline

Called

Weights

1

1

Type Weighted total

TP 6

FP 1

FN 1

Sync points

a b c d e f

Sync point

Simple homozygous weighting

False positive (excluded)

Baseline

Called

False negative (excluded)

1 1 1 1

1

2

Type Weighted total

TP 4

FP 1

FN 2

Sync point

a b c d e f

Simple heterozygous case (non-phased) weighting

a b c d e f

Called

1 1 1 1 0.5 0.5

Baseline

Type Weighted total

TP 5

FP 0

FN 0Sync point

Complex weighting

ROC Plot

http://biorxiv.org/content/early/2014/01/24/001958

Acknowledgements

RTG, Hamilton, New Zealand John Cleary Len Trigg Mehul Rathoud

Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab)

This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA.

© 2014 Real Time Genomics, Inc. All rights reserved.