Next-generation sequencing: informatics & software aspects

Next-generation sequencing:informatics & software aspects

Gabor T. MarthBoston College Biology Department

Harvard NanocourseOctober 7, 2009

New sequencing technologies…

… offer vast throughput … & many applications

read length

10 bp 1,000 bp100 bp

100 Mb

Illumina/Solexa, AB/SOLiD sequencers

ABI capillary sequencer

Roche/454 pyrosequencer(100Mb-1Gb in 200-450bp reads)

(10-50Gb in 25-100 bp reads)

100 Gb

Genome resequencing for variation discovery

short INDELs

structural variations

• the most immediate application area

Genome resequencing for mutational profiling

Organismal reference sequence

• likely to change “classical genetics” and mutational analysis

De novo genome sequencing

Lander et al. Nature 2001

• difficult problem with short reads

• promising, especially as reads get longer

Identification of protein-bound DNA

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. (Robertson et al. Nature Methods, 2007)

DNA methylation. (Meissner et al. Nature 2008)

• natural applications for next-gen. sequencers

Transcriptome sequencing: transcript discovery

Mortazavi et al. Nature Methods 2008

Ruby et al. Cell, 2006

• high-throughput, but short reads pose challenges

Transcriptome sequencing: expression profiling

Jones-Rhoads et al. PLoS Genetics, 2007

Cloonan et al. Nature Methods, 2008

• high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

… & enable personal genome sequencing

The re-sequencing informatics pipelineREF

(ii) read mappingIND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV calling GigaBayesGigaBayes

The variation discovery “toolbox”

• base callers• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

Base error characteristics vary

Inser-tions

Dele-tions

Substitutions95.34%

Illumina

Read lengths vary

read length [bp]0 100 200 300

~200-450 (variable)

25-100(fixed)

25-50 (fixed)

25-60 (variable)

Sequence traces are machine-specific

Base calling is increasingly left to machine manufacturers

Representational biases

Fragment duplication

Read mapping is like a jigsaw

… and they give you the picture on the box

2. Read mapping…you get the pieces…

Unique pieces are easier to place than others…

Multiply-mapping reads

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• “Traditional” repeat masking does not capture repeats at the scale of the read length

Dealing with multiple mapping

Paired-end (PE) reads

fragment length: 100 – 600bp

Korbel et al. Science 2007

fragment length: 1 – 10kb

PE reads are now the standard for whole-genome short-read sequencing

Gapped alignments (for INDELs)

The MOSAIK read mapper

• gapped mapper• option to report multiple map locations• aligns 454, Illumina, SOLiD, Helicos reads• works with standard file formats (SRF, FASTQ, SAM/BAM)

Michael Strömberg

Alignment post-processing

0 5 10 15 20 25 30 35 40 45 500

Called base quality

• quality value re-calibration

• duplicate fragment removal

Data storage requirements

Alignment visualization

• too much data – indexed browsing• too much detail – color coding, show/hide

SNP calling: old problem, new data

sequencing error polymorphism

Pr | Pr | Pr , , ,

Pr , , , |i

nk ki i i n

nk k l l l li i

B T T G G G G

G G G B

Allele calling in next-gen data

New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection

SNP calling in multi-sample read sets

P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(SNP)

“genotype probabilities”

P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)

P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)

P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)

“genotype likelihoods”

r(G1,.

.,Gi,..

-----a----------a----------c----------c-----

-----a----------a----------a----------a----------c-----

-----c----------c----------c----------c-----

Trio sequencing

2 22 2

11 12 22

1 111: 1 12 2 11: 111: 1

1 111 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 12 2

1 1 111: 1 1 11:2 2 4

Pr | , 1 112 12 : 2 1 12 2

1 122 : 12 2

C M FF

G G GG

2 22 2

1 1 1 11 1 11: 12 4 2 2

1 1 1 1 112 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 14 2 4 2 2

1 111: 12 211: 1

1 122 12 : 1 12 : 12

22 : 1FG

11:2 1 12 : 2 1

222 : 11 122 : 1 1

• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child

Alignment visualization

• too much data – indexed browsing• too much detail – color coding, show/hide

Standard data formats

SRF/FASTQ

SAM/BAM

GLF/VCF

Human genome polymorphism projects

common SNPs

Human genome polymorphism discovery

The 1000 Genomes Project

Pilot 1

Pilot 2

Pilot 3

1000G Pilot 3 – exon sequencing

• Targets:1K genes / 10K targets

• Capture: Solid / liquid phase

• Sequencing:454 / IlluminaSE / PE

• Data producers:BaylorBroadSangerWash. U.

• Informatics methods:Multiple read mapping &SNP calling programs

Coverage varies

On/off target capture

ref allele*:45%

non-ref allele*: 54%Target region

SNP(outside target region)

Fragment duplication – revisited

Reference allele bias

(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346

ref allele*:54%

non-ref allele*: 45%

SNP calling findings

BCM/454 BI/SLX WUGSC/SLX SC/SLX# Samples 32 23 16 11

<read depth> per sample 35 X 62 X 117 X 51 X

# SNPs called 7,200 – 8,400 4,500 – 4,700 3,700 3,500 – 3,700

% dbSNPs 39 - 55 65 - 72 68 75 - 85

Ts/Tv(#SNP) 1.7 – 2.6 1.9 – 2.3 2.3 2.5 – 2.6

# Novel SNPs 3,998 1,550 1,947 892

• based on a method comparison / testing exercise

• 80 samples drawn from the 4 Centers

• read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)

Overlap between call sets

45217238.05%1.21

2,2961,86281.10%3.40

413245.81%0.23

Broad callsBC calls

# SNP calls:# dbSNPs:% dbSNPs:Ts/Tv ratio:

The 1000G Structural Variation Discovery Effort

Structural variation detection

Feuk et al. Nature Reviews Genetics, 2006

SV detection – resolution

Expected CNVsKaryotype

Micro-arraySequencing

CNV event length [bp]

Detection Approaches

• Read Depth: good for big CNVs

Sample Reference

contig

• Paired-end: all types of SV

• Split-Readsgood break-point resolution

• deNovo Assembly~ the future

SV slides courtesy of Chip Stewart, Boston College

Read depth (RD)

Statistical & systematic biases

Single molecule sequencing?

GC Bias

Coverage bias

RD resolution

ity (l

expected read counts ( per kb)

CN = 2

CN = 1

CN = 3

CNV events detected with RD

Chromosome 2 Position [Mb]

Helicos

40 kb deletion 454

Illumina

individual “NA12878”

SV detection with PE read map positions

Deletion

Individual Reference

LMLdel

Tandemduplication Ldup

InversionLinv

Fragment length distributions

• long fragments ~ better fragment coverage and sensitivity to large events (454)• tighter distributions ~ better breakpoint resolution and sensitivity for shorter events (Illumina)

The SV/CNV event display

chromosomeoverview

fragment lengths

read depth

eventtrack

300 bp deletion in chromosome of NA12878 by Illumina paired-end data from the 1000 Genomes project

Chip Stewart

Deletion event lengths

Mobile element insertions: PE reads

• Used with short-read data (Illumina, in our case)• Detect clusters of 5’ & 3’ read pairs with one end mapping

to a mobile element• Clusters far from annotated elements are candidate

insertion events

ALU / L1 / SVA5’ end 3’ end

Mobile element insertions: Split reads

• Requires longer reads (454)• Reads “mapping into” mobile element not present in the

reference genome sequence are candidate insertion events

Mobile element insertions: trio members

356185 182

NA12891684

NA12892664

ALU insertions

NA12878733

Detection in 1000 G Pilot 2 CEU trio PE Illumina data

Mobile element insertions: PE vs. Split-reads

569247 163454 Split-Read817

SLX PE733

ALU insertions in NA12878

BC event lists in 1000 Genomes data

SV typePilot 1140

samples low coverage

Pilot 26 samples

high coverage

deletions 5,555 4,718tandem

duplications 540 406mobile

element insertions

3,276 2,013

SV calls / validation in 1000G datasets

Software access

http://bioinformatics.bc.edu/marthlab/Software_Release

Credits

Elaine Mardis

Andy Clark

Aravinda Chakravarti

Michael Egholm

Scott Kahn

Francisco de la Vega

Patrice MilosJohn Thompson

R01 HG004719R01 HG003698R21 AI081220RC2 HG5552

Several positions are available:grad students / postodocs /

programmers

Can we find exon CNVs in Pilot 3 data?

SVs in exon sequencing data

Next-generation sequencing: informatics & software aspects

Documents

Sequencing techniques and genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2011

School of Informatics CG087 Time-based Multimedia Assets Sampling & SequencingDr Paul Vickers1 Sampling & Sequencing Combining MIDI and audio

Clinical Research Informatics - CDM Media · Clinical Research Informatics Biomedical Informatics Clinical Informatics Bio (Molecular) Informatics Nursing Informatics Dental Informatics

Bio-IT World & Cambridge Healthtech Institute’s Fifth ... · Genome Informatics Sequencing Informatics Europe Clinical Genomics ... Challenging translational Bottlenecks Manon van

Sequencing Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics

Next Generation Sequencing: A Policy Perspectivedownload2.eurordis.org/ecrd2012/T4S0404_K_DEVRIENDT.pdfProf. Jan Aerts (bio-informatics) Prof. Pascal Borry (medicalethics) Prof. Koen

Technical Aspects of Multimodal System University of ... · 2 Technical Aspects of Multimodal System Dept. Informatics, Faculty of Mathematics, Informatics and Natural Sciences University

Proteomics Informatics – Protein identification III: de novo sequencing (Week 6)

Technical Aspects of Multimodal System University of ... · Technical Aspects of Multimodal System Dept. Informatics, Faculty of Mathematics, Informatics and Natural Sciences University

Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008

Master Thesis in Informatics What aspects are …DB...Master Thesis in Informatics What aspects are important in an ex-post investment evaluation? Dzenana Begovac Sandra Fagerström

Sequence Assembly and Next Generation …compbio.ucdenver.edu/7711 2013/Leach_CPBS7711_09172011...Sequence Assembly and Next Generation Sequencing Informatics CPBS7711 Sept 17, 2013

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory

IT Law in the Framework of Legal Informatics · IT law forms a natural part of legal informatics and how the theoretical ... discussions of legal aspects of information technology

New Aspects ofwseas.us/books/2010/Catania/TELE-INFO.pdf · 2010. 6. 4. · New Aspects of TELECOMMUNICATIONS and INFORMATICS 9th WSEAS International Conference on TELECOMMUNICATIONS

Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics

Decoupling Aspects in Board Games Modeling · Decoupling Aspects in Board Games Modeling Fulvio Frapolli, Amos Brocco , Apostolos Malatras and B´eat Hirsbrunner Department of Informatics,

BASIC PROCEDURES SAMPLING, MATERIAL, SANGER SEQUENCING · 2019-12-12 · Technical aspects of TP53 mutation analysis: BASIC PROCEDURES – SAMPLING, MATERIAL, SANGER SEQUENCING Sarka

Proteomics Informatics – Protein identification III: de novo sequencing (Week 6)