17
Variation reference graphs and the variation graph toolkit vg Erik Garrison, Jouni Siren, Eric Dawson, Richard Durbin Wellcome Trust Sanger Institute Adam Novak, Benedict Paten et al., UCSC and many others

Variation reference graphs and the variation graph toolkit vg

Embed Size (px)

Citation preview

Page 1: Variation reference graphs and the variation graph toolkit vg

Variation reference graphs and

the variation graph toolkit vg

Erik Garrison, Jouni Siren, Eric Dawson, Richard Durbin

Wellcome Trust Sanger InstituteAdam Novak, Benedict Paten et al., UCSC

and many others

Page 2: Variation reference graphs and the variation graph toolkit vg

Variation Reference• Go beyond a linear reference

– Why a (quasi)-linear reference and a catalog of variants which we keeping finding again?

• Local variation: graph reference– Map to a structure including known variation– >99% variants per person already seen

• Long range variation: haplotype structure– Exploit variation sharing – support phasing– Recombination rate ~ mutation rate– >99% recombination breakpoints per person

seen

Page 3: Variation reference graphs and the variation graph toolkit vg

Variation Reference• Go beyond a linear reference

– Why a (quasi)-linear reference and a catalog of variants which we keeping finding again?

• Local variation: graph reference– Map to a structure including known variation– >99% variants per person already seen

• Long range variation: haplotype structure– Exploit variation sharing – support phasing– Recombination rate ~ mutation rate– >99% recombination breakpoints per person

seen

Page 4: Variation reference graphs and the variation graph toolkit vg

Variation graphs: “Pan Genome”

A variation graph represents many genomes in one non-redundant structure.

Nodes contain sequence and edges between the ends of nodes represent potential links between successive sequences

Page 5: Variation reference graphs and the variation graph toolkit vg

Variation graphs and train tracks

The links in a variation graph are bidirectional. They behave in many ways like train tracks.Nodes have positive

and negative strands, allowing them to be traversed in either direction, and can be connected to form loops (repeats), inversions and translocations.

NB There are other ways to do this. One can have sequence on edges. Or unidirectional graphs (nearly) twice as big.

Page 6: Variation reference graphs and the variation graph toolkit vg

“Computational Pan-Genomics: Status, Promises and Challenges.”Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) in press

Essential operations on pan-genomes

Page 7: Variation reference graphs and the variation graph toolkit vg

github.com/vgteam/vg

Operations implemented in vg

Page 8: Variation reference graphs and the variation graph toolkit vg

Implementation in vg

• Nodes with sequence, Edges, Paths, Mappings

• Alignment tools and .gam format

• Serialisation to disk via protobuf, succinct representation xg, graph building/editing, extraction, unrolling and DAGification of local graphs etc.https://github.com/vgteam/vg

Page 9: Variation reference graphs and the variation graph toolkit vg

AGCTCTCCTTGTCCCTCCTACGATCTCTTCACTGGCCTCTTATCTTTACTGTTACCAAATCTTTCCGGAAGCTGCTCTTTC

find k-mer subgraphs

read

k-mers

node ids

hit clusterscluster ids

target subgraphpartial order alignment

Alignment

k-mer based alignment of short reads to a variation graph

store results in Graph Alignment Map (GAM) format

Page 10: Variation reference graphs and the variation graph toolkit vg

Alternative index: GCSA2

• Generalised Compressed Suffix Array– Jouni Siren, Niko Valimaki, Veli Makinen

• Natural extension of BWT to graphs– Essentially set of minimal unique k-

mers with one base prefix extension– Supports compression, FM-index style

search etc.• Now implemented for vg graph

search– <20GB index, fast SMEM seed and

extend search

Jouni Siren talk tomorrow(Maximal Exact Match)

Page 11: Variation reference graphs and the variation graph toolkit vg

Pilot alignment and variant calling evaluation

Slides from Benedict Paten and collaborators

Page 12: Variation reference graphs and the variation graph toolkit vg
Page 13: Variation reference graphs and the variation graph toolkit vg
Page 14: Variation reference graphs and the variation graph toolkit vg

Genotyper output

The genotyper considers support for every bubble based on embedded paths and emits genotypes as Locus records that are each a set of alleles represented as paths relative to the base graph.

Most variants are within the reference.Also consider new variants by (temporarily) augmenting the graph to include repeatedly seen alignment alternatives.

Page 15: Variation reference graphs and the variation graph toolkit vg

Genotype evaluation mix CHM1/13 Illumina reads – truth from PacBio

MHC BRCA1

Page 16: Variation reference graphs and the variation graph toolkit vg

Reference Graph

AugmentedGraph &

Alignments

Alignments, Paths, Genotypes, and

Annotations Relative to the Augmented

GraphAlignedReads

Translation

Coordinates in vg are not stable across graph edits.

But, we can retain a mapping from new to old coordinates when editing.

This translation provides a stable coordinate system for VGs, solves surjection problem, and enables a virtuous feedback loop!

An architecture supporting stable coordinates

Page 17: Variation reference graphs and the variation graph toolkit vg

Thank you Erik Garrison, Jouni Sirén, Eric Dawson, Jerven Bolleman, Adam Novak, Glen Hickey, Benedict Paten, Will Jones, Jordan Eizenga, Toshiaki Katayama, Orion Buske, Raoul Bonnal, Mike Lin, and many others who have helped us understand, design, implement and evaluate vg.