View
8
Download
0
Category
Preview:
Citation preview
Mutation detection using whole genome
sequencing
2017 Winter School in Mathematical and
Computational Biology
Ann-Marie Patch
3rd July 2017
Mutation Detection success depends on previous steps
Sample
preparation
Library
preparation
Sequence
generation
Initial data
processing
Data
analysis
•DNA
•RNA
•miRNA
•BAC
•Fragmentation
•Size selection
•Target
enrichment
•Indexing
•Platform
•Sequence
length
•Base calling
•Quality
assessment
•De novo
assembly
•Alignment
•Mutation
detection
•Annotation
•Biological
interpretation
Generalised sequencing workflow
Whole genome paired-end sequencing process recap
- library preparation Genomic DNA
Sample preparation is key to getting good results
High molecular weight DNA is required (not just DNA quantification)
10 000bp
Smears indicate degraded samples
Genomic DNA
Fragment DNA
Clean-up DNA fragments
Consistent fragment size distribution
across all your samples
Whole genome paired-end sequencing process recap
- library preparation
Adaptors added
Sequence reads produced from both
ends of each fragment
The distance from the ends of the reads
should follow the DNA size distribution
Clean-up DNA fragments
Whole genome paired-end sequencing process recap
- library preparation
HiSeq ~300 bp
HiSeq XTen ~ 400 bp
FFPE (formalin fixed paraffin embedded)
DNA samples have a high degree of
fragmentation.
This produces a shorter TLEN so think about
the read length of the sequencing or you
could end up paying to sequence the
adapters
Fragment length median 150bp Adapter 1 Adapter 2
Read 1
Read 2
HiSeq X Ten sequencing 2x 150bp
Library production is key for successful mutation
detection
I II I I II
Reference genome
Paired-end sequence alignment to a reference genome
Paired-end sequences
mapped to genome
Read depth
Examining how the mapping position and content of the pairs of
reads vary across the reference genome allows us to determine
variations and structural rearrangements
SNV/indels
* * *
I II I I II
Detection software pinpoints differences in your sample
compared to a reference
deletions
amplifications
translocations
Variants are recorded as positional information and read counts generated by
detection software
Software tools are usually designed to detect one type of variant
Choosing mutation calling software
Choice can be guided by
Type of data
The biological question
Available computing resources
Past experience
Related literature
QIMR DNA variant detection
•Substitutions
•qSNP – QIMR
•GATK – Unified genotyper– Broad
•Small insertions and deletions
•GATK – Unified genotyper - Broad
•Large structural variations
•qSV – QIMR
•Copy number aberrations
•ASCAT-ngs
Identifying software to try
One of many on-line resources
https://omictools.com/whole-genome-resequencing-category
Just remember:
Each piece of software will give different results
Results depend on the quality of the starting
material
DNA/RNA quality
disease or organism type
Evaluate the output from each tool for your data
and research question
Visualising the variants detected
Robinson et al 2011 XTen WGS 150bp paired end data
Grey blocks
base pairs
matching
reference
Small coloured
blocks indicate a
change from the
reference
Reference sequence
Below is a 404bp region from human chromosome 3
A pair of reads
Identifying signal from noise
Robinson et al 2011
Signal – consistent variants per
position across a number of reads
Reference sequence
Below is a 40bp region over exon 14 of BAP1
Noise – random
variants per position
in only a few reads
Total read count = 37
No of reference (T) = 19
No of alternate (C) = 18
51% T : 49% C
In a diploid organism this
equates to a heterozygous
call
Ensure coverage is not a limiting factor
In germline sequencing most homozygous SNVs are detected at a 15X average depth but an
average depth of 33X is required to reproducibly detect the same proportion of heterozygous
SNVs.
Bentley, et al. Nature (2008).
Uninformative reads impact on the final coverage and ultimately the mutation detection
sensitivity
Sims et al Nat Rev Genet. 2014
What about when the noise level is high?
Is 22% of the alternate allele
enough to make a good call?
Same XTen data as before but a different region
Understand the characteristics of the sequencing reads
Detection tools read in the alignment files (BAM files)
Samtools can help you investigate the reads that provide evidence of the variant
SAM fields
1 QNAME Query template/pair NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)
8 MPOS 1-based Mate POSistion
9 TLEN inferred Template LENgth (insert size)
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE
Mononucleotide runs are particularly error prone
Inconsistent small deletions of 2 to 7 bp at a homopolymer run of A’s Polymerase slippage during sequencing?
Sources of bias in sequencing data
Changes in expected proportions can be due to:
• Sample purity/integrity and heterogeneity
• Stochastic sampling/low coverage depth
• Capture or enrichment bias
• Alignment/mapping bias
• Repetitive regions of the reference sequence
• Sequencing error / Platform related artefact
How should we determine a good call from error?
Detection tools often attempt to provide a confidence level for the variant call
qSNP in-house, rules-based heuristic tool sensitive (Kassahn et al 2013)
GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)
Raw
Germline
Filtered
Germline
qSNP 4,180,630 3,698,034
GATK 4,945,990 4,069,314
A simplified view states humans are 99.9% identical
We therefore expect ~3,000,000 single nucleotide variants per person
( 1000 per Mb or 0.1%)
Filtering of results from mutation detection tools is
necessary
Strategy for identifying and filtering substitution variants
Quality filter the reads that are used by the detection software Remove duplicate reads
Require a minimum mapping quality for reads e.g. >10
Impose a maximum number of mismatches allowed in read e.g.<=3
Require a minimum number of consecutive matched bases in a read >=34
Understand the characteristics that influence the confidence of a variant call What’s the minimum number/proportion of variant containing reads required
Is there a minimum read depth for a good call
Are the base qualities for variant positions taken in to account
Look for potential weakness in calls by adding your own annotation
Position of variant within the reads are they all at the ends of reads
Is the variant identified in reads sequenced in both directions
Is the variant identified in the majority of your samples so could be artefact
Is the variant in a repetitive area of the reference
Download an appropriate published dataset to compare your output with what was published.
For human germline sequencing there are standards datasets
• Genome in a bottle (National Institute of Standards and Technology)
• Platinum genomes from CEPH family (Illumina)
• For cancer COLO-829 is often used
Verification
Use a different technology or source material to test a selection of your variant calls
Benchmark your processes and verify your findings
Detect mutations Examine
Manual IGV review
Identify patterns and
modify filtering
strategies
•PCR and capillary sequencing
•PCR and deep MiSeq sequencing
•SOLiD sequencing
•mRNA sequencing
Mutation cataloguing aids understanding of cancer genetics
Our Projects
Ovarian
Pancreatic
Melanoma
International Cancer Genome Consortium projects
ICGC patient summary of mutations identified
Circos, Krzywinski et al 2009
Chromosomes
SNP array track that shows copy
number gain in red and loss in
green and regions of loss of
heterozygosity
Structural variants in centre
Coding small mutations with
amino acid change
ICGC data portal: https://dcc.icgc.org/
Cancer sample sequencing involves the parallel analysis of
at least two samples for each patient
Inherited genome sample
Germline
Tumour genome sample
Tumour
This sample contains a mixed population of normal and tumour cells
Also subpopulations of different tumour cells
Data
Analysis •Mutation
detection
•Annotation
•Biological
interpretation
Germline variants Seen in both samples
Somatic mutations Specific to tumour
sample
* * * *
* * * *
Normal/Germline DNA: Germline
SNV
Somatic
SNV
* * *
I II I I II
Cancer genomics studies identify both germline and
somatic changes
Somatic
deletion
Somatic
amplification
Somatic
translocation
Aim: to identify the changes that only occur in the tumour cells
Tumour DNA:
For tumour data low frequency variants may be clinically
relevant
BRCA2 exon 9 BRCA2 exon 10
5 bp germline deletion
Normal control
sample
Metastasis 1
Metastasis 2
Ovarian cancer patient with a deleterious germline BRCA2 deletion
High quality 3 bp somatic deletion
Six deletion reversion mutations identified within BRCA2 from
a single rapid autopsy case
BRCA2 exon 9 BRCA2 exon 10
5 bp germline deletion
Normal control
sample
Metastasis 1
Metastasis 2
Low
frequency
reversion
deletions
Evidence of 2
exon deletion
High frequency
reversion
deletion
Different reversion deletions could be identified in differing
proportions at multiple metastasis sites
Events 1-3 and 9 are found
in many abdominal
deposits where as 5, 7 and
14 are only identified in
one
Patch et al Nature 2015
Large genomic structural variants need different detection
strategies
Sample
Reference
Ovarian cancer genomes have high instability and are highly rearranged
Structural variants underlie copy number changes
Spectral Karyotype from HGS OvCa Cell line Ouellet et al 2008 BMC Cancer
Deletion Duplication/Insertion Translocation Low resolution
Alkan, Coe and Eichler 2011
There are 4 main methods for SV detection in WG sequencing
Some tools only use one detection method
but there are multi-method tools now available
Tumour
Germline
Visualising structural variants
Sub microscopic homozygous deletion in a tumour sample
Robinson et al 2011
Chromosome 13:
1.3Kb somatic deletion
including exon 17 of
RB1 gene
Tumour
Germline
Discordantly mapped read pairs mark rearrangements
Detection tools identify clusters of read pairs with similar
characteristics e.g. BreakDancer Chen et al 2009
reference
Read pairs too
far apart
Read pairs too
close together
Read pairs in wrong
orientation
>1.3kb insert size
Typical aligned read-pair insert size distribution
visualised by qProfiler
DNA fragment size distribution
Base pairs
Log c
ount
300bp median
Normally mapped
reads
Insert size estimation is key for detection with
discordantly mapped read pairs
~300 bp
Insert size depends on
DNA fragmentation step
Paired-end reads
Aligned pairs insert size
Changes in coverage support rearrangements
Clear drop in coverage
over the region in the
tumour sample Tumour
Germline
Tools are available that identify copy number variants from read depth partitioning and GC
content and mapability correction plus allele frequency analysis
Genomic position
Titan (Ha et al 2014 Genome Res)
Changes in coverage can be interpreted as copy number and
can mark rearrangement breakpoints
Fewer reads
mapped
More reads
mapped
Deletion
CN
by C
ove
rag
e
Alle
le fre
qu
ency
Clusters of soft clipping indicate rearrangement break points
Alignment software that performs soft clipping
can reveal exact positions of the break points
Split reads and assembled contigs reveal microhomology
Further realignment of the clipped sequences reveals split reads
Reads with soft clipping and unmapped reads can be assembled into contigs that span
breakpoints
Patterns of microhomology can be obtained from these data
CREST Wang et al 2012
qSV : Detecting Somatic Structural Variants
qSV detects 3 types of supporting evidence
Resolves all lines of evidence to identify breakpoints to base pair resolution
Felicity Newell
http://sourceforge.net/projects/adamajava/
Associating structural variants with proximal genes
How do the breakpoints and rearrangements affect the underlying genes?
© QIMR Berghofer Medical Research Institute | 39
SVs may promote tumour development
Oncogenes can be amplified by rearrangements resulting in gain of function
Chou et al 2013 Genome Med
Amplification of HER2 (ERBB2) in
pancreatic adenocarcinoma
Gene 1 Gene 2
Duplication of Gene 2
Gene 1
Cancer molecular subtyping with cohort studies
Take a group of samples with the same disease and look for the same gene/pathway
being altered - Molecular subtyping
X 100’s
Single patient Disease specific cohort
80% 15% 4% 1%
Cancer molecular subtyping with cohort studies
Molecular subtyping can be performed using, and by integrating, different data sources
Mutations
Gene expression
Methylation
Copy number
Structural rearrangements
450 pancreatic cancers
Bailey et al Nature 2016 Waddell et al Nature 2015
100 WGS pancreatic cancers
Mutational signatures
Pan-cancer molecular subtyping
The Cancer Genome Atlas Pan-Cancer analysis project
Nature Genetics 45, 1113–1120 (2013)
Leukaemia
Lung adenocarcinoma
Lung squamous
Kidney
Bladder
Endometrial
Glioblastoma
Head and neck
Breast
Ovarian
Colon
Rectum
Pan-Cancer analysis to identify common molecular features
of tumours
International consortia make pan-cancer studies possible extending the cohort
approach
X 100’s
Pan-cancer studies are
indicating that existing
treatment options can be
repurposed for other
cancer types
X 1000’s
Single patient Disease specific cohort
Multiple cohorts
Personalised
treatment
selection
Acknowledgements:
Genome informatics:
John Pearson
Conrad Leonard
Oliver Holmes
Qinying Xu
Scott Wood
Sean Grimmond
National Health and Medical Research Council
Australian Government
Anna deFazio
Catherine Kennedy
Yoke-Eng Chiew
Jillian Hung
Clinicians and patients
Medical genomics:
Nicola Waddell
Katia Nones
Felicity Newell
Stephen Kazakoff
Martha Zakrzewski
Venkateswar Addala
Andrew Biankin
David Chang
Peter Bailey
Jianmin Wu
Jeremy Humphris
Mark Pinese
Angela Chou
Mark Cowley
APGI collaborators http://www.pancreaticcancer.net.au/apgi/collaborators
Including: John Fawcett, O’Rourke, Andrew Barbour,
Henry Tang, Kelly Slater, Nik Zeps
Amber Johns
Anthony Gill
Scott Mead
Skye Simpson
Marc Jones
David Bowtell
Dariush Etemadmoghadam
Elizabeth Christie
Dale Garsed
Joshy George
Sian Fereday
Laura Galletta
Kathryn Alsop
Nadia Traficante
Thank you
Email:
ann-marie.patch@qimrberghofer.edu.au
www.qimrberghofer.edu.au
Recommended