Stanford Comprehensive Cancer Center Stanford Genome Technology Center An Introduction to Next Generation Sequencing Hanlee Ji, M.D. Stanford University

Stanford Comprehensive Cancer CenterStanford Genome Technology Center

An Introduction toNext Generation Sequencing

Hanlee Ji, M.D. ã Stanford University


Overview

• Principles of next generation DNA sequencing

• Analysis of genetic variation and research

applications


Advances in DNA sequencing technology

M. Stratton et al. Nature 458 (2009)


Applications

• Identifying genetic variants

– Whole genome

– Exome

– Subsets

• Transcriptomes (e.g. RNASeq)

• Chip-seq

• Epigenomes (methylation)

• Many others!


Sequencing-by-synthesis

• Individual DNA molecules from a “sequencing library”.

• Sequencing via multiple cycles of nucleotide incorporation.

• Solid phase support

• High density reads using a photodetector (i.e. CCDS) or solid state system

• Images per cycle provides sequence data.

J. Shendure and Ji. Nat Biotech (2008)


Sequencing-by-ligation

Complete Genomics

• DNA nanoballs from circles

• Combinatorial probe anchor ligation

• 10 base reads adjacent to 8 anchor sites

• 31- to 35-base mate-paired reads

Dramanac et al. Science (2010)


Solid state detection of DNA synthesis

Rothberg et al. Nature (2011)

“Nanowell” solid-state detection


Single molecule sequencing

New technologies

• Single molecule

detection

• Pacific Biosciences

– Sequencing by

synthesis

– Single base

incorporation

“nanowell” sequencing-by-synthesis


Nanopore sequencing

• DNA inserted in a nanopore in lipid membrane

• speed control provided by a phi29 DNA polymerase

• Translocation via an electrical field and polymerase DNA sequence via changes in the ionic current


Issues with next generation DNA sequencing

• Higher sequencing error rates

– <0.1 to 10% or greater depending on

sequencing chemistry and configuration

• Systematic bias based on approach

• Short sequence reads (<250 bases)

• Massive data output

– Data storage anagement

– Variant calling analysis


Aspects of sequencing next generation sequencing

• DNA sequencing library preparation.

• Processing of sequence reads

• Types of reads (e.g. “mate pairs”)

• Alignment

– Fold coverage

• Assembly

• Variant calling


Overview of the process in whole genome sequencing

D Koboldt et al. Briefings in Bioinformatics (2010)


Sequencing library preparation – 454 system


Sequencing process


Sequencing data generation and analysis

D Koboldt et al. Briefings in Bioinformatics (2010)


Quality metrics to improve variant calls

• Sequencing fold coverage based on alignment.

– Higher fold coverage required in cancer genomes

• Elimination of duplicate reads.

– Bottlenecks which propagate errors from DNA

amplification.

• Using high quality base calls

– Quality scores 30 or higher

• Repeat sequences in genomes.

• Significance or confidence values for variants


DNA sequence data format and visualization

• Sequence alignment map (SAM)• Viewing “pileups”


Genetic variation

• Point mutations

– Nonsynonymous versus synonymous

• Insertion / deletions (indels)

• Copy number variations (CNVs)

• Structural variants (SV)

– Intrachromosomal

• Large indels

• Duplications

• Inversions

– Interchromosomal

• Balanced translocations

• Imbalanced translocations


Single nucleotide variants from cancer genomes

P Sohrab et al. Nature, 461 (2010)


Variant callers

• Genome Analysis Toolkit

• Varscan• SAMTools• SNVmix• Others…


Single nucleotide mutations

• Silent = synonymous

• Substitution = nonsynonymous

• Nonsense = premature stop

http://commons.wikimedia.org


Transitions versus transversion mutations

Ding et al. Nature (2010)

Transition

Transition

Transversion

Transversion

Transversion

Transversion

• Transition– A <-> G– C <-> T

• Transversions– A <-> T– A <-> C– G <-> T– G <-> C


Small insertion and deletions


Targeting strategies for resequencing genomic subsets

Array-based

hybridization

capture

In-solution capture

(e.g. molecular

inversion probes)

In-solution

Hybridization

capture


Rapid targeted mutation analysis from cancer genomes

aTarget-specific

oligonucleotide

Preparation Processing

Single-adaptor

library

Flow cell

Hybridization, extension

and denaturation

Immobilized DNA

b STEP 1

Primer-probe

preparation

STEP 2

Target capture

STEP 3

Cluster preparation

Sequence

Data

Primer-probe

Immobilized Primer ‘C’

Immobilized Primer ‘D’

Sequencing Primer 2

Sequencing Primer 1


“Onconomic” diagnostic mutations analysis

• Rapid mutation for point-of-care analysis

• Analysis of identified cancer drivers

• Determination of pathogenic mutations

• Example, nonsense mutation in SMAD4

Normal

Tumor


Visualizing sequence

SNP genotyping

1.5 Mb region on Chromosome 18


Whole genome sequencing

M. Stratton et al. Nature, 458 (2009)


A needle in a human genome haystack?

• A human genome

has 23

chromosomes.

• 6 billion individual

DNA basepairs per

genome.

• A single basepair

error can be a

disease mutation.

..GATC..ERROR..TTCCAA..

X


Exome sequencing

M. Clark et al. Nature Biotechnology (2012)


A cancer family pedigree

Colorectal Cancer

No Cancer

Male Female

Colon Polyps

42 y/o43 y/o

AP


AP

MotherFather

Assessment of a cancer family – unaffected versus affected


AP

Exome sequencing analysis for identifying inherited disease

91 2 3 4 5 6 7 8 10 11 etc.

Mother

1 2 3 4 5 6 7 8 9 10 11 etc.

AP’s unique family variants

Father

1 2 3 4 5 6 7 8 9 10 11 etc.

• Identify the variants unique to the affected members.


Interpretation of genetic variants

• Substitutions translation bioinformatically

• SIFT - probability that a substitution is tolerated– < 0.05 is deleterious.

• PolyPhen – categorical definitions– "benign", "possibly

damaging" and "probably damaging”

• Protein structural mapping

IDH1 mapping of Arg132 cancer mutation

Parson et al., Science, (2008)


Sequence assembly

• Assembling fragments of

random sequence to form a

set of larger contiguous

sequences (contigs).

• Used to assemble de novo

genomes of new organisms.

• Useful for reconstruction

regions of high complexity

such as SVs.

Zerbino DR, Birney E, Genome Research, 18 (2010)


Metagenomic characterization of bacterial flora


Copy number from genome sequencing

• Genome shotgun

sequencing

comparison.

• Copy number

variation derived

directly from

sequence reads.

• 15 Kb windows with

sequence tag

counting

Campbell et al., Nature Genetics, (2008)


Copy number variations (CNVs) from genomic sequencing

Genomic sequence analysis

Array CGH CNV analysis

Breast cancer – Chromosome 1


Structural variations in human genomes

Deletion Duplication Inversion

Translocation

http://commons.wikimedia.org

Intrachromosomal

Interchromosomal

Insertion


Structural variation

• Mate pair sequences dependent on the genomic DNA insertion size (population).

Exon i n

Exon i n+1

Normal

Tumor

300 nts

300 nts

Exon i n

Deleted region

Intact region


Genomic deletion analysis

• Breast cancer genome

sequencing.

• Mate pair sequences

used in indel analysis.

• Changes in the

location of mapped

reads that are not

concordant with the

sequencing library

insert size.

Normal

Primary

Metastasis

Xenograft

Ding et al, Nature, (2010)


Structural variants from small cell lung cancer genome

Campbell et al., Nature Genetics, (2008)

Duplication Inversion


Translocations in colorectal cancer genomes

• Balanced tranlsocations between chr 8 and 20 p arms

• Structural changes can only be delineated based on

sequencingBass et al., Nature Genetics, (2011)


Cancer transcriptome sequencing (RNASeq)

• Mate pair analysis from prostrate cancer mRNA• Identification of reads indicating gene fusions.

N Palanisamy et al, Nat Med 16 (2007)


Sequenced cancer genomes – nonsmall cell lung

Lee et al. Nature 465, (2010)

Tumor coverage 60 X

Normal coverage 46 X

Mutation rate per Mb 17.7

Total identified tumor mutations 83,000

Coding mutations 540

Validated mutations 302

Total identified indels 54,921

Coding indels 253

Total identified structural variants 79

Validated structural variants 43


Whole genome analysis of colorectal cancer

• Cancer Genome

Atlas analysis of

colon

adenocarcinoma

• “Circos” plots of

whole genome data


Gene expression and RNASeq


CHIP-Seq


Ultrasensitive mutation detection

• Robust detection of 1 mutant allele from 1,000 wildtype

alleles in heterogeneous mixtures

• Application to viral infections

• Analysis of cancer point mutations

Flaherty et al., Nucleic Acids Research, 2012


Deep resequencing for rare variants

Smith et al., Nature 2009

• Derived from

reassortment of swine

and human flu in swine

• More than 214

countries in 2009

• More than 622,482

infections confirmed

• 18,449 deaths

confirmed by WHO


Oseltamivir resistance mutation in influenza


• Neuramindase bound

to oseltamivir

(Tamiflu)

• Mutations cluster

around sialic

acid/oseltamivir

binding pocket

Collins et al. Nature 2008.

Detection of the oseltamivir resistance mutationc


Phylogenetic tree of H1 influenza genomes


Conclusion

• Multiple approaches available for analysis of genomes

• Scale of sequence data requires extensive computational,

bioinformatic and statistical data analysis

• Methods, technologies and analysis continue to improve

and become simpler.



Genetics via large scale sequencing

• mutations and other

genomic DNA aberrations

contribute to neoplastic

development

• Specific genetic variants

and other indicate clinical

phenotype

• Utility as diagnostics


Cancer exome survey (Sanger sequencing)

• Each row represents a chromosome.• Peaks represent driver mutations

TP53

Wood et al, Science, 318 (2007)


Tiers of cancer genome sequencing

Complete Human Genomes

Exomes &

Transcriptomes

Genomic

Subsets

Cancer diagnostic

Translational

studies

Discovery

?


Whole cancer genome sequencing

• Pros

– Most comprehensive coverage of the genome.

– Least bias – most objective analysis

– Highest resolution at base pair level

– Identification of complex structural variants

– Experimentally straightforward…

• Cons

– Cost (rapidly dropping!)

– Rapidly evolution of technologies

– Challenging data management and analysis


Sample and genetic complexity of cancer

• Sample variability

– Normal stroma contamination

– Mixtures of variable lineages

– Degradation of DNA

• Intratumoral genetic heterogeneity

– Clonal subpopulations carrying different

mutations

• Background random mutations (e.g.

passengers)

• Complex genomic structure


Sequencing cancer genomes – clinical samples

• Type of samples

– Cancer cell lines

– Xenografts

– Primary tumors

– Purified cancer cells

• Requirement for matched samples

– Normal diploid genome


False positive mutation rates in genome sequencing

• Mutation false positive rate requires high accuracy.

– 1 base / 10,000 error = 300,000 false mutations

– 1 base / 100,000 error = 30,000 false mutations

• Ideal false positive rate

– 1 base / 1,000,000 error

– ~50% of candidate mutations are correct!

Documents

Stanford Comprehensive Cancer Center Stanford Genome Technology Center An Introduction to Next Generation Sequencing Hanlee Ji, M.D. Stanford University