39
Structural variants and mutation detection using whole genome sequencing 2014 Winter School in Mathematical and Computational Biology AnnMarie Patch

Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Embed Size (px)

DESCRIPTION

As part of the International Cancer Genome Consortium, the Queensland Centre for Medical Genomics has established a world class laboratory and computational infrastructure balanced with high level expertise to enable the analysis of whole human genomes for the presence of DNA, RNA and epigenetic variants that are associated with the hallmarks of cancer. This talk will describe and discuss the principles and challenges of identifying structural variants (SVs) using whole genome sequencing. I will present the bases of SVs detection,a tool developed at the QCMG and examples of how SV analysis can identify mechanisms driving tumorigenesis. First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Citation preview

Page 1: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Structural variants and mutation detection using whole genome sequencing 

2014 Winter School in Mathematical and Computational Biology

Ann‐Marie Patch

Page 2: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Mutation Detection success depends on previous steps

Generalised sequencing workflow

Sample  Library  Sequence  Initial data  Data analysispreparation preparation generation processing Data analysis

F t ti •Base calling M t ti•DNA•RNA•miRNABAC

•Fragmentation•Size selection•Target enrichment

•Platform•Sequence length

g•Quality assessment

•De novo bl

•Mutation detection

•Annotation•Biological •BAC •Indexing assembly

•Alignment

ginterpretation

Page 3: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Mutation cataloguing aids understanding of cancer genetics

International Cancer Genome Consortium projects

sequenced >1500 samples from >600 patients

Page 4: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

DNA mutations are normally sensed and repaired or the cell dies

Cell Suicide or ApoptosisNormal cell divisionDNA damage is sensed by the cellMechanisms for

DNA damage

Mechanisms forDNA repairCell cycle check pointsSignalling for cell death

g g

Signalling for growth

Page 5: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Cancer cells accumulate mutations

Cell Suicide or ApoptosisTumour cell divisionDNA damage is NOT sensed by the cellDisrupted mechanisms for

DNA damage

Disrupted mechanisms forDNA repairCell cycle check pointsSignalling for cell death

g g

Signalling for growth

Fourth orlater mutation

Third mutation

Second mutation

First mutation Uncontrolled growth

Page 6: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Cancer sequencing basics

Typically our projects involve the parallel analysis of at least two samples for each patient

Inherited genome sample

Germline variants

Germline Data AnalysisSeen in both samples

Tumour genome sample

•Mutation detection

•Annotation•Biologicalg p •Biological interpretation

Somatic mutationsSpecific to tumour sample

Tumour

This sample contains a mixed population of normal and tumour cellsand tumour cellsAlso subpopulations of different tumour cells

Page 7: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Mutation detection – finding differences

DNA Mutation detection

•SNV/SNP/Substitutions

•Small insertions and deletionsdeletions

•Large structural variations

•Copy number aberrations

Cloonan et al 2011

Large structural variants are only detectable from whole genome sequencing

Page 8: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Whole genome paired‐end sequencing process recap ‐ library preparationlibrary preparation

Genomic DNA

Fragmented DNA 

Page 9: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Whole genome paired‐end sequencing process recap ‐ library preparation

Genomic DNAlibrary preparation

Fragmented DNA 

Clean‐up DNA fragmentsClean‐up DNA fragments

Consistent fragment size distribution

Page 10: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Whole genome paired‐end sequencing process recap ‐ library preparation

Clean‐up DNA fragments

library preparation

p g

Adaptors added

Sequence reads produced from both ends of each fragment

The distance from the ends of the reads should follow the DNA size distribution 

~300 b~300 bp

Page 11: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Paired‐end sequence alignment to the reference genome

I II I I IIReference genome

I II I I II

Paired‐end sequences mapped to genome

Coverage depthmapped to genome

Examining how the mapping position and content of the pairs of reads vary across the reference genome allows us to determine mutationsvary across the reference genome allows us to determine mutations and structural rearrangements

Page 12: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Detection software pinpoints differences in your sample from the reference

II IIII II II IIII

****

Normal/Germline DNA:Germline

SNV

*

**

Tumour DNA:**

**Somatic

SNV

*

SomaticSomatic

translocationdeletion

Somaticamplificationamplification

We convert mutation data into positional information and counts using detection software

Somatic mutations that only occur in the tumour are determined

Page 13: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Choosing what software to use to identify mutations

Software listSoftware listhttp://seqanswers.com/wiki/Software/list

Choice can be guided by

Type of data 

QCMG DNA mutation detection

•Substitutions

The biological question

A ailable omp tin reso r es

•qSNP – in house tool•GATK – Broad

•Small insertions and deletions•Pindel ‐ SangerAvailable computing resources

Past experience

Pindel ‐ Sanger•GATK ‐ Broad

•Large structural variations•qSV – in house tool

Related literature

Page 14: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Visualising a germline single nucleotide variant examplePaired‐end HiSeq data for Ovarian Cancer patientPaired end HiSeq data for Ovarian Cancer patient Chromosome 11

Grey blocks 100bp readsmatching

Small coloured blocks indicate a change

Tumour data

matching reference

indicate a change from the reference

The reference base is Tumour data The reference base isa G 

There is an A present pin some of the reads

Normal data

Robinson et al 2011Reference sequence

Page 15: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Pileup analysis produces counts of alleles

Coverage 56xCoverage 56x

Count of non duplicate reads that cross any given position

Tumour datathat cross any given position

Allele frequencyCount of bases at any position

Tumour G=36

Allele frequency

Tumour  G 36 A=20(Total coverage 56x)

Normal data Normal  G=26 A=33 T=1(Total coverage 60x)

Page 16: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Considering error and bias

Allele proportions

Sample Coverage Reference  Alternate  Other  Bi‐allelic Hi hl k dallele % allele % allele % ratio

Tumour 56 G=64% A=36% ‐ 1:0.56

Normal 60 G=43% A=55% T=2% 1:1.3

Highly skewed representation in tumour samples 

Sequencing error

Diploid organismexpected bi‐allelic proportion 50% (ratio 1:1)

Tumour data Changes in expected proportions can be due to:Sample contamination/integrityStochastic sampling/low coverage depth

Normal data

Capture or enrichment biasAlignment/mapping strategySequencing error

How should we determine a good call from error?

Page 17: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

How many SNVs would we expect to find?

Human genome (length ~ 3,000,000,000 bases)

Germline changes  = ~ 3,000,000 (~1000 mutations per Mb (0.1%))

Ovarian Cancer genome

Somatic mutation  = ~6,000 (~2 mutations per Mb)

Thi b l d di h f b i dThis number can vary greatly depending upon the type of cancer being sequenced

Page 18: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Filtering of results from mutation detection tools is necessary

Example for sample purity = 64%

R Filt d R Filt dRaw somatic

Filteredsomatic

Raw Germline

Filtered Germline

qSNP 298,388 6,632 4,180,630 3,698,034GATK 224,839 9,722 4,945,990 4,069,314

K b t 2 4% K b t 84 88%

R b th t d b f ti t ti ~6 000

Keep between 2‐4% Keep between 84‐88%

Remember the expected number of somatic mutations ~6,000And Germline variants ~3,000,000

qSNP in‐house, rules‐based heuristic tool sensitive (Kassahn et al 2013)GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)

The intersect of these tools produces a high confidence SNV call 

Page 19: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

QCMG Strategy for identifying somatic substitution mutations

Control of quality of variant calls through input filteringmapping quality for reads >10maximum number of mismatches in read <=3maximum number of mismatches in read < 3minimum consecutive matched bases in a read >=34duplicate reads removed

Tumour dataSomatic variant calls are made when theminimum number of reads with the variant minimum coverage in tumour and normal samplemaximum variant count for a given coverage in the matched normalmaximum variant count for a given coverage in the matched normalthreshold proportion of variant call qualities at that position

Potential weakness in calls annotatedNormal data

Potential weakness in calls annotatedVariant seen in unfiltered bam of matched normalPosition of variant within 5 bp of ends of readsVariant not seen in sequencing reads of both directions

l f h

Somatic variant

Variant seen in germline of another patientNumber of novel starts for reads supporting variant is low

Somatic variant Tumour T=63% C=37%Normal T=100%

Page 20: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Detection – examination – verification ‐modify

We have used a cyclical feedback approach to inform the filtering strategy and  improve our mutation calling

Detectmutations ExamineManual IGV review

Independent VerificationIdentify patterns and  •PCR and capillary sequencingmodify filtering strategies

p y q g•PCR and deep MiSeq sequencing•SOLiD sequencing•mRNA sequencing

This approach has been key for the detection of small insertions and deletionsThis approach has been key for the detection of small insertions and deletions as sequencing errors and alignment biases are often exaggerated for indels 

Page 21: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Large genomic structural variants need different detection strategies

O i h hi h i t bilit d hi hl dOvarian cancer genomes have high instability and are highly rearrangedStructural variants underlie copy number changes

Spectral Karyotype from HGSOvCa Cell line Ouellet et al 2008 BMC Cancer

Deletion Duplication/Insertion Translocation

Low resolution

Reference

Sample

Page 22: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

There are 4 main methods for SV detection in WG sequencing

Alkan, Coe and Eichler 2011

Most well known tools only use one detection methodMost well known tools only use one detection method

a few multi‐method tools are now available

Page 23: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Visualising structural variants

Sub microscopic homozygous deletion in a tumour sample

Tumour

Normal

Robinson et al 2011Chromosome 13: 1.3Kb somatic deletion including exon 17 of RB1 gene

Page 24: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Insert size estimation is key for detection with discordantly mapped read pairs

DNA fragment size distribution

pp p

Production of sequence reads fromsequence reads from the end of the fragments

300bp median

~300 bp

Alignment of read pairs 

Typical read‐pair insert size

300bp median g pallows calculation of insert size

Typical read pair insert size distribution visualised by qProfiler

g coun

tLog

Base pairsNormally mapped reads

Page 25: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Discordantly mapped read pairs mark rearrangements

reference

>1.3kb insert size

reference

Read pairs too far apart

Tumour Read pairs too close together

Normal

Read pairs in wrong orientation

Detection tools identify clusters of read pairs with similar characteristics

orientation

pairs with similar characteristics

large clusters indicate more evidence 

Page 26: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Changes in coverage support rearrangements

Clear drop in coverage over the region in the tumour sample

Tumour

Normal

Page 27: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Coverage changes are often associated with SVs

Changes in coverage can be interpreted as copy number and can mark rearrangement breakpointsg p

Deletion Duplication

mbe

r

DeletionFewer reads mapped

Copy num

More reads mapped

Genomic positionCNVnator (Abyzov et al 2011)

Tools are available that identify copy number variants from read depth i i i d GC ipartitioning and GC content correction

Page 28: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Clusters of soft clipping indicate rearrangement break points

Alignment software that performs soft clipping can reveal exact positions of the break points

Further realignment of the clipped sequences produces split reads

Reads with soft clipping and unmapped reads can be assembled into contigs thatassembled into contigs that span break points

Page 29: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

qSV : Detecting Somatic Structural Variants

qSV detects 3 types of supporting evidence

Resolves all lines of evidence to identify breakpoints to base pair resolutionResolves all lines of evidence to identify breakpoints to base pair resolution

Felicity Newell

Page 30: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Automation of SV verification processSt t l i t i ifi tiStructural variants require verification

PCR amplification over breakpoints followed by sequencing 

Automation of key stages can increase throughput of verification

PCR of tumour and normal DNAVerified events are circled

Quek et al in press

Page 31: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Characterising tumour genomes by the distribution of SVs

A huge range in the distribution of SVs in ovarian cancer patients

Unstable >300 events Complex localised eventsChromosomes

Copy number

B allele frequency

SVs

Circos, Krzywinski et al 2009 

Page 32: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Chromothripsis events can be identified

SV break d itdensity

SV types and positions

Copy number segmentation

Log R RatioLog R Ratio

B allele frequency

Chromosome 15

Stephens et al 2011

Page 33: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Breakage‐fusion‐bridge amplification can be identified

SV break density

SV types and positions

Copy number segmentation

Log R Ratio

B allele frequency

Chromosome 12Loss of telomere region

Control sample

Tumor sample

Kinsella and Bafna 2012

Page 34: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Other complex regions with high density of breakpoints

SV break density

Translocations

SV types and positions

Copy number isegmentation

Log R Ratio

Chromosome 19

B allele frequency

Chromosome 19

Page 35: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Associating structural variants with proximal genes

Structural variants break points are annotated with genes features

Page 36: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Gene model annotation of break points can predict fusion genes

G f i ifGene fusions can occur if:•both breakpoints are within the footprints of genes•the transcription direction of the two genes align•translation phase of adjoining exons match•translation phase of adjoining exons match•splicing signals are not disrupted

Barsha Poudel

Page 37: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Patient summary of mutations identified

chromosomes

Coding small mutationsCoding small mutations with amino acid change

SNP array track that shows copy number gain in red and loss in green and regions of loss ofgreen and regions of loss of heterozygosity

Structural variants in centreStructural variants in centre

Circos, Krzywinski et al 2009 ICGC catalogue of mutations

Page 38: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Mutation detection summary

Output of mutation detection software requires careful filtering

Development of filtering strategy typically requires a feedback process 

Verification is a key part of this process

Detectmutations ExamineManual IGV reviewManual IGV review

Independent VerificationIdentify patterns andIdentify patterns and modify filtering strategies

Detection – examination – verification ‐modify

Page 39: Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

Acknowledgements: 

Bioinformatics:John PearsonFelicity Newell

Genome Biology:Sean GrimmondNicola Waddell

Peter MacCallum Cancer CentreDavid BowtellDariush EtemadmoghadamElizabeth ChristieDale GarsedFelicity Newell

Lynn FinkConrad LeonardOliver HolmesQinying XuMatthew Anderson

Katia NonesPeter BaileyMichael QuinnKelly Quek

Joshy George Sian FeredayLaura GallettaKathryn AlsopNadia TraficanteMatthew Anderson

Stephen KazakoffNick WaddellScott Wood

Sequencing:David MillerAngelika ChristTim BruxnerC i N

Nadia TraficanteJoy HendleyChris MitchellPrue Cowin

Craig NourseEhsan NourbakhshSuzanne ManningIvon HarliwongSenel Idrisoglu

Previous team membersKarin KassahnBarsha PoudelSarah Song

Westmead Institute for Cancer ResearchA d F i

gShivangi Wani

Sarah SongNicole CloonanDarrin TaylorDeborah GywnnePeter WilsonAnita Steptoe

Anna deFazioCatherine KennedyYoke-Eng ChiewJillian Hung

National Health and Medical Research Council

Australian GovernmentClinicians and patients