Mutation detection using whole genome...

Mutation detection using whole genome

sequencing

2017 Winter School in Mathematical and

Computational Biology

Ann-Marie Patch

3rd July 2017

Mutation Detection success depends on previous steps

Sample

preparation

Library

preparation

Sequence

generation

Initial data

processing

analysis

•DNA

•RNA

•miRNA

•BAC

•Fragmentation

•Size selection

•Target

enrichment

•Indexing

•Platform

•Sequence

length

•Base calling

•Quality

assessment

•De novo

assembly

•Alignment

•Mutation

detection

•Annotation

•Biological

interpretation

Generalised sequencing workflow

Whole genome paired-end sequencing process recap

- library preparation Genomic DNA

Sample preparation is key to getting good results

High molecular weight DNA is required (not just DNA quantification)

10 000bp

Smears indicate degraded samples

Genomic DNA

Fragment DNA

Clean-up DNA fragments

Consistent fragment size distribution

across all your samples

- library preparation

Adaptors added

Sequence reads produced from both

ends of each fragment

The distance from the ends of the reads

should follow the DNA size distribution

Clean-up DNA fragments

- library preparation

HiSeq ~300 bp

HiSeq XTen ~ 400 bp

FFPE (formalin fixed paraffin embedded)

DNA samples have a high degree of

fragmentation.

This produces a shorter TLEN so think about

the read length of the sequencing or you

could end up paying to sequence the

adapters

Fragment length median 150bp Adapter 1 Adapter 2

HiSeq X Ten sequencing 2x 150bp

Library production is key for successful mutation

detection

I II I I II

Reference genome

Paired-end sequence alignment to a reference genome

Paired-end sequences

mapped to genome

Read depth

Examining how the mapping position and content of the pairs of

reads vary across the reference genome allows us to determine

variations and structural rearrangements

SNV/indels

I II I I II

Detection software pinpoints differences in your sample

compared to a reference

deletions

amplifications

translocations

Variants are recorded as positional information and read counts generated by

detection software

Software tools are usually designed to detect one type of variant

Choosing mutation calling software

Choice can be guided by

Type of data

The biological question

Available computing resources

Past experience

Related literature

QIMR DNA variant detection

•Substitutions

•qSNP – QIMR

•GATK – Unified genotyper– Broad

•Small insertions and deletions

•GATK – Unified genotyper - Broad

•Large structural variations

•qSV – QIMR

•Copy number aberrations

•ASCAT-ngs

Identifying software to try

One of many on-line resources

https://omictools.com/whole-genome-resequencing-category

Just remember:

Each piece of software will give different results

Results depend on the quality of the starting

material

DNA/RNA quality

disease or organism type

Evaluate the output from each tool for your data

and research question

Visualising the variants detected

Robinson et al 2011 XTen WGS 150bp paired end data

Grey blocks

base pairs

matching

reference

Small coloured

blocks indicate a

change from the

reference

Reference sequence

Below is a 404bp region from human chromosome 3

A pair of reads

Identifying signal from noise

Robinson et al 2011

Signal – consistent variants per

position across a number of reads

Reference sequence

Below is a 40bp region over exon 14 of BAP1

Noise – random

variants per position

in only a few reads

Total read count = 37

No of reference (T) = 19

No of alternate (C) = 18

51% T : 49% C

In a diploid organism this

equates to a heterozygous

Ensure coverage is not a limiting factor

In germline sequencing most homozygous SNVs are detected at a 15X average depth but an

average depth of 33X is required to reproducibly detect the same proportion of heterozygous

Bentley, et al. Nature (2008).

Uninformative reads impact on the final coverage and ultimately the mutation detection

sensitivity

Sims et al Nat Rev Genet. 2014

What about when the noise level is high?

Is 22% of the alternate allele

enough to make a good call?

Same XTen data as before but a different region

Understand the characteristics of the sequencing reads

Detection tools read in the alignment files (BAM files)

Samtools can help you investigate the reads that provide evidence of the variant

SAM fields

1 QNAME Query template/pair NAME

2 FLAG bitwise FLAG

3 RNAME Reference sequence NAME

4 POS 1-based leftmost POSition/coordinate of clipped sequence

5 MAPQ MAPping Quality (Phred-scaled)

6 CIGAR extended CIGAR string

7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)

8 MPOS 1-based Mate POSistion

9 TLEN inferred Template LENgth (insert size)

10 SEQ query SEQuence on the same strand as the reference

11 QUAL query QUALity (ASCII-33 gives the Phred base quality)

12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE

Mononucleotide runs are particularly error prone

Inconsistent small deletions of 2 to 7 bp at a homopolymer run of A’s Polymerase slippage during sequencing?

Sources of bias in sequencing data

Changes in expected proportions can be due to:

• Sample purity/integrity and heterogeneity

• Stochastic sampling/low coverage depth

• Capture or enrichment bias

• Alignment/mapping bias

• Repetitive regions of the reference sequence

• Sequencing error / Platform related artefact

How should we determine a good call from error?

Detection tools often attempt to provide a confidence level for the variant call

qSNP in-house, rules-based heuristic tool sensitive (Kassahn et al 2013)

GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)

Germline

Filtered

Germline

qSNP 4,180,630 3,698,034

GATK 4,945,990 4,069,314

A simplified view states humans are 99.9% identical

We therefore expect ~3,000,000 single nucleotide variants per person

( 1000 per Mb or 0.1%)

Filtering of results from mutation detection tools is

necessary

Strategy for identifying and filtering substitution variants

Quality filter the reads that are used by the detection software Remove duplicate reads

Require a minimum mapping quality for reads e.g. >10

Impose a maximum number of mismatches allowed in read e.g.<=3

Require a minimum number of consecutive matched bases in a read >=34

Understand the characteristics that influence the confidence of a variant call What’s the minimum number/proportion of variant containing reads required

Is there a minimum read depth for a good call

Are the base qualities for variant positions taken in to account

Look for potential weakness in calls by adding your own annotation

Position of variant within the reads are they all at the ends of reads

Is the variant identified in reads sequenced in both directions

Is the variant identified in the majority of your samples so could be artefact

Is the variant in a repetitive area of the reference

Download an appropriate published dataset to compare your output with what was published.

For human germline sequencing there are standards datasets

• Genome in a bottle (National Institute of Standards and Technology)

• Platinum genomes from CEPH family (Illumina)

• For cancer COLO-829 is often used

Verification

Use a different technology or source material to test a selection of your variant calls

Benchmark your processes and verify your findings

Detect mutations Examine

Manual IGV review

Identify patterns and

modify filtering

strategies

•PCR and capillary sequencing

•PCR and deep MiSeq sequencing

•SOLiD sequencing

•mRNA sequencing

Mutation cataloguing aids understanding of cancer genetics

Our Projects

Ovarian

Pancreatic

Melanoma

International Cancer Genome Consortium projects

ICGC patient summary of mutations identified

Circos, Krzywinski et al 2009

Chromosomes

SNP array track that shows copy

number gain in red and loss in

green and regions of loss of

heterozygosity

Structural variants in centre

Coding small mutations with

amino acid change

ICGC data portal: https://dcc.icgc.org/

Cancer sample sequencing involves the parallel analysis of

at least two samples for each patient

Inherited genome sample

Germline

Tumour genome sample

Tumour

This sample contains a mixed population of normal and tumour cells

Also subpopulations of different tumour cells

Analysis •Mutation

detection

•Annotation

•Biological

interpretation

Germline variants Seen in both samples

Somatic mutations Specific to tumour

sample

* * * *

Normal/Germline DNA: Germline

Somatic

I II I I II

Cancer genomics studies identify both germline and

somatic changes

Somatic

deletion

Somatic

amplification

Somatic

translocation

Aim: to identify the changes that only occur in the tumour cells

Tumour DNA:

For tumour data low frequency variants may be clinically

relevant

BRCA2 exon 9 BRCA2 exon 10

5 bp germline deletion

Normal control

sample

Metastasis 1

Metastasis 2

Ovarian cancer patient with a deleterious germline BRCA2 deletion

High quality 3 bp somatic deletion

Six deletion reversion mutations identified within BRCA2 from

a single rapid autopsy case

BRCA2 exon 9 BRCA2 exon 10

5 bp germline deletion

Normal control

sample

Metastasis 1

Metastasis 2

frequency

reversion

deletions

Evidence of 2

exon deletion

High frequency

reversion

deletion

Different reversion deletions could be identified in differing

proportions at multiple metastasis sites

Events 1-3 and 9 are found

in many abdominal

deposits where as 5, 7 and

14 are only identified in

Patch et al Nature 2015

Large genomic structural variants need different detection

strategies

Sample

Reference

Ovarian cancer genomes have high instability and are highly rearranged

Structural variants underlie copy number changes

Spectral Karyotype from HGS OvCa Cell line Ouellet et al 2008 BMC Cancer

Deletion Duplication/Insertion Translocation Low resolution

Alkan, Coe and Eichler 2011

There are 4 main methods for SV detection in WG sequencing

Some tools only use one detection method

but there are multi-method tools now available

Tumour

Germline

Visualising structural variants

Sub microscopic homozygous deletion in a tumour sample

Robinson et al 2011

Chromosome 13:

1.3Kb somatic deletion

including exon 17 of

RB1 gene

Tumour

Germline

Discordantly mapped read pairs mark rearrangements

Detection tools identify clusters of read pairs with similar

characteristics e.g. BreakDancer Chen et al 2009

reference

Read pairs too

far apart

Read pairs too

close together

Read pairs in wrong

orientation

>1.3kb insert size

Typical aligned read-pair insert size distribution

visualised by qProfiler

DNA fragment size distribution

Base pairs

300bp median

Normally mapped

Insert size estimation is key for detection with

discordantly mapped read pairs

~300 bp

Insert size depends on

DNA fragmentation step

Paired-end reads

Aligned pairs insert size

Changes in coverage support rearrangements

Clear drop in coverage

over the region in the

tumour sample Tumour

Germline

Tools are available that identify copy number variants from read depth partitioning and GC

content and mapability correction plus allele frequency analysis

Genomic position

Titan (Ha et al 2014 Genome Res)

Changes in coverage can be interpreted as copy number and

can mark rearrangement breakpoints

Fewer reads

mapped

Mutation detection using whole genome...

Documents

Annotation consistency using annotation intersections

Children With Generalised Joint Hypermobility

The Generalised Estimating Equations: An -

Generalised Classical Electrodynamics - Welcome to …signallake.com/innovation/GeneralizedClassicalElectrodynamics.pdf · Generalised Classical Electrodynamics for the prediction

Global Constraints: Generalised Arc Consistency

Lecture notes on Generalised Hydrodynamics

Generalised filtering and stochastic DCM for fMRIkarl/Generalised filtering and...1 Generalised ﬁltering and stochastic DCM for fMRI 2 Baojuan Li a,c,⁎, Jean Daunizeaua,b, Klaas

Generalised Circle

WS17: Effective Behavioral Interventions: Consultation

The Generalised Estimating Equations: An Annotated ... · The Generalised Estimating Equations: An Annotated Bibliography ... nicely by Sherman and Le Cessie ... The Generalised Estimating

Generalised anxiety cover disorder

raspberry pi (generalised)

Generalised instrumental variable models

Generalised Algebraic Data Types

Generalised Test Tables

GENERALISED CLASSICAL ADJOINT- COMMUTING MAPPINGS

Jesper M ller and Giovanni Luca Torrisi: Generalised Shot ... · and Giovanni Luca Torrisi: Generalised Shot Noise Cox Processes ISSN 1398-2699. 16 March 2004 GENERALISED SHOT NOISE

NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation

Internet Praktikum TK WS17/18 (Kickoff) · email with your topic proposal and we try to integrate it in our lectures ;) Internet Praktikum TK WS17/18 | Christian Meurisch (Telecooperation

generalised substitution box