Upload
grace-hines
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
An Introduction toNext Generation Sequencing
Hanlee Ji, M.D. ã Stanford University
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Overview
• Principles of next generation DNA sequencing
• Analysis of genetic variation and research
applications
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Advances in DNA sequencing technology
M. Stratton et al. Nature 458 (2009)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Applications
• Identifying genetic variants
– Whole genome
– Exome
– Subsets
• Transcriptomes (e.g. RNASeq)
• Chip-seq
• Epigenomes (methylation)
• Many others!
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequencing-by-synthesis
• Individual DNA molecules from a “sequencing library”.
• Sequencing via multiple cycles of nucleotide incorporation.
• Solid phase support
• High density reads using a photodetector (i.e. CCDS) or solid state system
• Images per cycle provides sequence data.
J. Shendure and Ji. Nat Biotech (2008)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequencing-by-ligation
Complete Genomics
• DNA nanoballs from circles
• Combinatorial probe anchor ligation
• 10 base reads adjacent to 8 anchor sites
• 31- to 35-base mate-paired reads
Dramanac et al. Science (2010)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Solid state detection of DNA synthesis
Rothberg et al. Nature (2011)
“Nanowell” solid-state detection
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Single molecule sequencing
New technologies
• Single molecule
detection
• Pacific Biosciences
– Sequencing by
synthesis
– Single base
incorporation
“nanowell” sequencing-by-synthesis
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Nanopore sequencing
• DNA inserted in a nanopore in lipid membrane
• speed control provided by a phi29 DNA polymerase
• Translocation via an electrical field and polymerase DNA sequence via changes in the ionic current
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Issues with next generation DNA sequencing
• Higher sequencing error rates
– <0.1 to 10% or greater depending on
sequencing chemistry and configuration
• Systematic bias based on approach
• Short sequence reads (<250 bases)
• Massive data output
– Data storage anagement
– Variant calling analysis
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Aspects of sequencing next generation sequencing
• DNA sequencing library preparation.
• Processing of sequence reads
• Types of reads (e.g. “mate pairs”)
• Alignment
– Fold coverage
• Assembly
• Variant calling
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Overview of the process in whole genome sequencing
D Koboldt et al. Briefings in Bioinformatics (2010)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequencing library preparation – 454 system
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequencing process
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequencing data generation and analysis
D Koboldt et al. Briefings in Bioinformatics (2010)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Quality metrics to improve variant calls
• Sequencing fold coverage based on alignment.
– Higher fold coverage required in cancer genomes
• Elimination of duplicate reads.
– Bottlenecks which propagate errors from DNA
amplification.
• Using high quality base calls
– Quality scores 30 or higher
• Repeat sequences in genomes.
• Significance or confidence values for variants
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
DNA sequence data format and visualization
• Sequence alignment map (SAM)• Viewing “pileups”
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Genetic variation
• Point mutations
– Nonsynonymous versus synonymous
• Insertion / deletions (indels)
• Copy number variations (CNVs)
• Structural variants (SV)
– Intrachromosomal
• Large indels
• Duplications
• Inversions
– Interchromosomal
• Balanced translocations
• Imbalanced translocations
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Single nucleotide variants from cancer genomes
P Sohrab et al. Nature, 461 (2010)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Variant callers
• Genome Analysis Toolkit
• Varscan• SAMTools• SNVmix• Others…
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Single nucleotide mutations
• Silent = synonymous
• Substitution = nonsynonymous
• Nonsense = premature stop
http://commons.wikimedia.org
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Transitions versus transversion mutations
Ding et al. Nature (2010)
Transition
Transition
Transversion
Transversion
Transversion
Transversion
• Transition– A <-> G– C <-> T
• Transversions– A <-> T– A <-> C– G <-> T– G <-> C
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Small insertion and deletions
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Targeting strategies for resequencing genomic subsets
Array-based
hybridization
capture
In-solution capture
(e.g. molecular
inversion probes)
In-solution
Hybridization
capture
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Rapid targeted mutation analysis from cancer genomes
aTarget-specific
oligonucleotide
Preparation Processing
Single-adaptor
library
Flow cell
Hybridization, extension
and denaturation
Immobilized DNA
b STEP 1
Primer-probe
preparation
STEP 2
Target capture
STEP 3
Cluster preparation
Sequence
Data
Primer-probe
Immobilized Primer ‘C’
Immobilized Primer ‘D’
Sequencing Primer 2
Sequencing Primer 1
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
“Onconomic” diagnostic mutations analysis
• Rapid mutation for point-of-care analysis
• Analysis of identified cancer drivers
• Determination of pathogenic mutations
• Example, nonsense mutation in SMAD4
Normal
Tumor
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Visualizing sequence
SNP genotyping
1.5 Mb region on Chromosome 18
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Whole genome sequencing
M. Stratton et al. Nature, 458 (2009)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
A needle in a human genome haystack?
• A human genome
has 23
chromosomes.
• 6 billion individual
DNA basepairs per
genome.
• A single basepair
error can be a
disease mutation.
..GATC..ERROR..TTCCAA..
X
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Exome sequencing
M. Clark et al. Nature Biotechnology (2012)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
A cancer family pedigree
Colorectal Cancer
No Cancer
Male Female
Colon Polyps
42 y/o43 y/o
AP
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
AP
MotherFather
Assessment of a cancer family – unaffected versus affected
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
AP
Exome sequencing analysis for identifying inherited disease
91 2 3 4 5 6 7 8 10 11 etc.
Mother
1 2 3 4 5 6 7 8 9 10 11 etc.
AP’s unique family variants
Father
1 2 3 4 5 6 7 8 9 10 11 etc.
• Identify the variants unique to the affected members.
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Interpretation of genetic variants
• Substitutions translation bioinformatically
• SIFT - probability that a substitution is tolerated– < 0.05 is deleterious.
• PolyPhen – categorical definitions– "benign", "possibly
damaging" and "probably damaging”
• Protein structural mapping
IDH1 mapping of Arg132 cancer mutation
Parson et al., Science, (2008)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequence assembly
• Assembling fragments of
random sequence to form a
set of larger contiguous
sequences (contigs).
• Used to assemble de novo
genomes of new organisms.
• Useful for reconstruction
regions of high complexity
such as SVs.
Zerbino DR, Birney E, Genome Research, 18 (2010)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Metagenomic characterization of bacterial flora
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Copy number from genome sequencing
• Genome shotgun
sequencing
comparison.
• Copy number
variation derived
directly from
sequence reads.
• 15 Kb windows with
sequence tag
counting
Campbell et al., Nature Genetics, (2008)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Copy number variations (CNVs) from genomic sequencing
Genomic sequence analysis
Array CGH CNV analysis
Breast cancer – Chromosome 1
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Structural variations in human genomes
Deletion Duplication Inversion
Translocation
http://commons.wikimedia.org
Intrachromosomal
Interchromosomal
Insertion
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Structural variation
• Mate pair sequences dependent on the genomic DNA insertion size (population).
Exon i n
Exon i n+1
Normal
Tumor
300 nts
300 nts
Exon i n
Deleted region
Intact region
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Genomic deletion analysis
• Breast cancer genome
sequencing.
• Mate pair sequences
used in indel analysis.
• Changes in the
location of mapped
reads that are not
concordant with the
sequencing library
insert size.
Normal
Primary
Metastasis
Xenograft
Ding et al, Nature, (2010)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Structural variants from small cell lung cancer genome
Campbell et al., Nature Genetics, (2008)
Duplication Inversion
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Translocations in colorectal cancer genomes
• Balanced tranlsocations between chr 8 and 20 p arms
• Structural changes can only be delineated based on
sequencingBass et al., Nature Genetics, (2011)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Cancer transcriptome sequencing (RNASeq)
• Mate pair analysis from prostrate cancer mRNA• Identification of reads indicating gene fusions.
N Palanisamy et al, Nat Med 16 (2007)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequenced cancer genomes – nonsmall cell lung
Lee et al. Nature 465, (2010)
Tumor coverage 60 X
Normal coverage 46 X
Mutation rate per Mb 17.7
Total identified tumor mutations 83,000
Coding mutations 540
Validated mutations 302
Total identified indels 54,921
Coding indels 253
Total identified structural variants 79
Validated structural variants 43
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Whole genome analysis of colorectal cancer
• Cancer Genome
Atlas analysis of
colon
adenocarcinoma
• “Circos” plots of
whole genome data
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Gene expression and RNASeq
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
CHIP-Seq
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Ultrasensitive mutation detection
• Robust detection of 1 mutant allele from 1,000 wildtype
alleles in heterogeneous mixtures
• Application to viral infections
• Analysis of cancer point mutations
Flaherty et al., Nucleic Acids Research, 2012
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Deep resequencing for rare variants
Smith et al., Nature 2009
• Derived from
reassortment of swine
and human flu in swine
• More than 214
countries in 2009
• More than 622,482
infections confirmed
• 18,449 deaths
confirmed by WHO
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Oseltamivir resistance mutation in influenza
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
• Neuramindase bound
to oseltamivir
(Tamiflu)
• Mutations cluster
around sialic
acid/oseltamivir
binding pocket
Collins et al. Nature 2008.
Detection of the oseltamivir resistance mutationc
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Phylogenetic tree of H1 influenza genomes
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Conclusion
• Multiple approaches available for analysis of genomes
• Scale of sequence data requires extensive computational,
bioinformatic and statistical data analysis
• Methods, technologies and analysis continue to improve
and become simpler.
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Genetics via large scale sequencing
• mutations and other
genomic DNA aberrations
contribute to neoplastic
development
• Specific genetic variants
and other indicate clinical
phenotype
• Utility as diagnostics
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Cancer exome survey (Sanger sequencing)
• Each row represents a chromosome.• Peaks represent driver mutations
TP53
Wood et al, Science, 318 (2007)
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Tiers of cancer genome sequencing
Complete Human Genomes
Exomes &
Transcriptomes
Genomic
Subsets
Cancer diagnostic
Translational
studies
Discovery
?
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Whole cancer genome sequencing
• Pros
– Most comprehensive coverage of the genome.
– Least bias – most objective analysis
– Highest resolution at base pair level
– Identification of complex structural variants
– Experimentally straightforward…
• Cons
– Cost (rapidly dropping!)
– Rapidly evolution of technologies
– Challenging data management and analysis
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sample and genetic complexity of cancer
• Sample variability
– Normal stroma contamination
– Mixtures of variable lineages
– Degradation of DNA
• Intratumoral genetic heterogeneity
– Clonal subpopulations carrying different
mutations
• Background random mutations (e.g.
passengers)
• Complex genomic structure
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
Sequencing cancer genomes – clinical samples
• Type of samples
– Cancer cell lines
– Xenografts
– Primary tumors
– Purified cancer cells
• Requirement for matched samples
– Normal diploid genome
Stanford Comprehensive Cancer CenterStanford Genome Technology Center
False positive mutation rates in genome sequencing
• Mutation false positive rate requires high accuracy.
– 1 base / 10,000 error = 300,000 false mutations
– 1 base / 100,000 error = 30,000 false mutations
• Ideal false positive rate
– 1 base / 1,000,000 error
– ~50% of candidate mutations are correct!