Upload
genome-reference-consortium
View
153
Download
0
Embed Size (px)
Citation preview
Variation reference graphs and
the variation graph toolkit vg
Erik Garrison, Jouni Siren, Eric Dawson, Richard Durbin
Wellcome Trust Sanger InstituteAdam Novak, Benedict Paten et al., UCSC
and many others
Variation Reference• Go beyond a linear reference
– Why a (quasi)-linear reference and a catalog of variants which we keeping finding again?
• Local variation: graph reference– Map to a structure including known variation– >99% variants per person already seen
• Long range variation: haplotype structure– Exploit variation sharing – support phasing– Recombination rate ~ mutation rate– >99% recombination breakpoints per person
seen
Variation Reference• Go beyond a linear reference
– Why a (quasi)-linear reference and a catalog of variants which we keeping finding again?
• Local variation: graph reference– Map to a structure including known variation– >99% variants per person already seen
• Long range variation: haplotype structure– Exploit variation sharing – support phasing– Recombination rate ~ mutation rate– >99% recombination breakpoints per person
seen
Variation graphs: “Pan Genome”
A variation graph represents many genomes in one non-redundant structure.
Nodes contain sequence and edges between the ends of nodes represent potential links between successive sequences
Variation graphs and train tracks
The links in a variation graph are bidirectional. They behave in many ways like train tracks.Nodes have positive
and negative strands, allowing them to be traversed in either direction, and can be connected to form loops (repeats), inversions and translocations.
NB There are other ways to do this. One can have sequence on edges. Or unidirectional graphs (nearly) twice as big.
“Computational Pan-Genomics: Status, Promises and Challenges.”Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) in press
Essential operations on pan-genomes
github.com/vgteam/vg
Operations implemented in vg
Implementation in vg
• Nodes with sequence, Edges, Paths, Mappings
• Alignment tools and .gam format
• Serialisation to disk via protobuf, succinct representation xg, graph building/editing, extraction, unrolling and DAGification of local graphs etc.https://github.com/vgteam/vg
AGCTCTCCTTGTCCCTCCTACGATCTCTTCACTGGCCTCTTATCTTTACTGTTACCAAATCTTTCCGGAAGCTGCTCTTTC
find k-mer subgraphs
read
k-mers
node ids
hit clusterscluster ids
target subgraphpartial order alignment
Alignment
k-mer based alignment of short reads to a variation graph
store results in Graph Alignment Map (GAM) format
Alternative index: GCSA2
• Generalised Compressed Suffix Array– Jouni Siren, Niko Valimaki, Veli Makinen
• Natural extension of BWT to graphs– Essentially set of minimal unique k-
mers with one base prefix extension– Supports compression, FM-index style
search etc.• Now implemented for vg graph
search– <20GB index, fast SMEM seed and
extend search
Jouni Siren talk tomorrow(Maximal Exact Match)
Pilot alignment and variant calling evaluation
Slides from Benedict Paten and collaborators
Genotyper output
The genotyper considers support for every bubble based on embedded paths and emits genotypes as Locus records that are each a set of alleles represented as paths relative to the base graph.
Most variants are within the reference.Also consider new variants by (temporarily) augmenting the graph to include repeatedly seen alignment alternatives.
Genotype evaluation mix CHM1/13 Illumina reads – truth from PacBio
MHC BRCA1
Reference Graph
AugmentedGraph &
Alignments
Alignments, Paths, Genotypes, and
Annotations Relative to the Augmented
GraphAlignedReads
Translation
Coordinates in vg are not stable across graph edits.
But, we can retain a mapping from new to old coordinates when editing.
This translation provides a stable coordinate system for VGs, solves surjection problem, and enables a virtuous feedback loop!
An architecture supporting stable coordinates
Thank you Erik Garrison, Jouni Sirén, Eric Dawson, Jerven Bolleman, Adam Novak, Glen Hickey, Benedict Paten, Will Jones, Jordan Eizenga, Toshiaki Katayama, Orion Buske, Raoul Bonnal, Mike Lin, and many others who have helped us understand, design, implement and evaluate vg.