Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Snippy

Torsten Seemann

Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015

Rapid bacterial variant calling & core genome alignments

Background

(Far) south east England

Phyloflagomics

UK / Birmingham Australia / Victoria Canada / British Columbia

A new home

Centre for Applied Microbial Genomics

Microbiological Diagnostic Unit

∷ Oldest public health lab in Australia: established 1897 in Melbourne: large historical isolate collection back to 1950s

∷ National reference laboratory: Salmonella, Listeria, EHEC

∷ WHO regional reference lab: vaccine preventable invasive bacterial pathogens

New director

∷ Professor Ben Howden: clinician, microbiologist, pathologist: early adopter of genomics and bioinformatics: long term collaborator on MRSA/VRE w/ Tim Stinear

∷ Mandate: modernise service delivery: enhance research output and collaboration: nationally lead the conversion to WGS

Hardware∷ Sequencers

: NextSeq 500: 3 x MiSeq: PacBio RS II (arriving 22 May)

∷ Robots: Perkin Elmer (does not have a Twitter account): Colony picker

∷ Compute: 240 TB, 10 GigE, 3 x 72 core boxes

Variant calling

∷ Find DNA differences between genomes: variants to explain phenotype: validate your complemented mutant

∷ Two approaches: reference based (read alignment): reference-free (de novo assembly / k-mer based)

Types of variants

∷ Substitutions: single nucleotide polymorphism (snp) A➝C: multiple nucleotide polymorphism (mnp) AG➝TC

∷ Indels: insertion (ins) A➝AC : deletion (del) ACCG➝AG

∷ Complex: compound events AC➝T

My solution

Snippy

∷ Fast → snappy

∷ Finds variants → SNPs

∷ Australian → Skippy the bush kangaroo

∷ FASTQ files: paired end, interleaved, or single-end

∷ Reference: FASTA or Genbank

∷ Output folder: self contained bundle of results

Inside the black box

∷ bwa mem - no clipping needed

∷ samtools - sorted, filtered BAM

∷ freebayes - split / GNU parallel / merge

∷ vcflib/vcftools - VCF filtering

∷ perl - glue

Outputs

∷ Read alignments: .bam / .bai

∷ Variants: .vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html

∷ Consensus: reference with all variants applied to it

∷ Genome alignment: reference with “-” (missing) and “N” low depth

TAB outputCHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT

chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein

chr 35524 snp G T T:73 G:1 C:1 tRNA -

chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase

chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein

plas 619 complex GATC AATA GATC:28 AATA:0

plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein

Phylogenomics

Phylogenetics 101∷ Choose some genes∷ Sequence each gene from each isolate∷ Align the protein sequences of each gene∷ Back-align to nucleotide space∷ Concatenate all the alignments∷ Construct a distance matrix (many ways)∷ Draw a tree (many ways)∷ Make wild inferences from little data

Phylogenomics 101

∷ Assemble each genome

∷ Perform whole genome alignment : in nucleotide space, as don’t know what is coding: very computationally expensive: can’t parallelize as with individual genes

∷ Continue as for phylogenetics

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTC

∷ Ideally, feed this directly to a tree builder∷ Properly model gaps, codons and ambiguity ∷ Hard!

Whole genome alignment

Core genome SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||

Core sites are present in all genomes.

Core genome

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |

Core SNPS = polymorphic sites in core genome

Core SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |SNPs’ | | | |

Unambiguous core SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCSNPs’ | | | | ata ttc ata atg 1 2 3 4

Allele sites

>bug1ATAA>bug2TTTT>bug3ACAG

Alignment ⇢Tree

+------ bug3 | ---+--- bug1 | +--------- bug2

--- 1 SNP

The N±1 problem

Aligning to reference

∷ Why is whole genome alignment not used?: involves genome (mis)assembly: computationally difficult: expensive to add or remove isolates

∷ Short-cut: choose a single reference: align each isolates reads to the reference: core, by definition, must include the reference

Read mapping considerations

∷ Choice of reference

∷ Too divergent?: reads may not align well: will get too many core genome SNPs

∷ One solution: Assemble one isolate and use as the reference

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore1 ||| ||||||||||| ||||||||||SNPs1 | | || |

Remove taxon, different core (1)

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore2 | | ||||||||| ||||||SNPs2 | | | | |

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore3 | ||||||||||||| ||||||SNPs3 | |

Core genome alignments

∷ Core SNP alignments: can shift dramatically with taxa content: we are only using globally conserved sites: remember variation still exists outside “core”

∷ Snippy will keep the full alignments: quickly derive subsets on the fly: adding isolates can be done quickly too

Conclusion

Snippy summary∷ The good

: Fast, scales to 100 cores: Simple, clean interface and output

∷ The bad: Doesn’t do full consequences yet using snpEff

∷ The ugly?: Written in Perl

Contact

∷ tseemann.github.io

∷ github.com/tseemann/snippy

∷ @torstenseemann

Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Science

DEVELOPER DATA SHEET - Bluefish444 Epoch... · DEVELOPER DATA SHEET A VARIANT B VARIANT C VARIANT D VARIANT E VARIANT X ... Neutron Full-height - LTC I/O / RS-422 EB3007X Epoch |

QRC Create Variant & Layout Create Variant & Layout.pdfQuick Reference Card – How to Create a Variant and Layout 1 How to Create a Variant and Layout in SAP Purpose: A variant allows

14 Variant

Australian Variant

prezentace variant

F-4 Phantom II Variant by Variant (1)

Variant PF2

Variant Configuration.doc

Variant Config

Process pumps - sal-tec.com4).pdf · RPH 6 Materials table (Asia and America) Part No. Description Variant S5 Variant S6 Variant A8 Variant C6 Variant D1 2) 102 Volute casing A 216

Variant confg

pdf variant

Passat y Passat Variant - letamendi.com · Passat y Passat Variant Passat y Passat Variant Think Blue. ... El Passat y Passat Variant, con motor 1,6 TDI BlueMotion2), tienen unas

solo variant

Passat Variant

Variant Maintenance

MON TUE WED THU SUN MON TUE WED THU FRI SUN MON TUE …

Variant Indexes

南アフリカ向け輸出 - South Africa...SAC WED (SAS) TUE TUE WED SAS MON MON TUE OPE YOKOHAMA SUN MON SUN SUN TUE KAWASAKI SHIMIZU NAGOYA YOKKAICHI SUN TUE TUE MON MON FRI SAT

LEC words for PER MS and QP 1.2 · First variant Mark Scheme First variant Principal Examiner’s Report Second variant Question Paper Second variant Mark Scheme Second variant Principal