Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College...

Preview:

Citation preview

Informatics for next-generation sequence analysis – SNP calling

Gabor T. MarthBoston College Biology Department

PSB 2008January 4-8. 2008

Read length and throughput

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

100 Mb

10 Mb

1Mb

1Gb

Illumina/Solexa, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(20-100 Mb in 100-250 bp reads)

(1-4 Gb in 25-50 bp reads)

Current and future application areas

• Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery

• De novo genome sequencing

• Short-read sequencing will be (at least) an alternative to micro-arrays for:

• DNA-protein interaction analysis (CHiP-Seq)• novel transcript discovery• quantification of gene expression• epigenetic analysis (methylation profiling)

DELSNP

reference genome

Fundamental informatics challenges (I)

1. Interpreting machine readouts – base calling, base error estimation

2. Dealing with non-uniqueness in the genome: resequenceability

3. Alignment of billions of reads

Informatics challenges (II)

5. Data visualization

4. SNP and short INDEL, and structural variation discovery

6. Data storage & management

Resequencing-based SNP discovery

genome reference sequence

Read mapping

Read alignment

Paralog identification

SNP detection + inspection

SNP calling workflow

• read alignment

• SNP detection

• visual checking

Bayesian detection algorithm

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

A

A

A

A

A

C

C

C

C

C

T

T

T

T

T

G

G

G

G

G

polymorphic combination

monomorphic combinationBayesian

posterior probability i.e. the SNP score

Base call + Base quality Polymorphism rate (prior)

Base composition Depth of coverage

Base quality values for SNP calling

• base quality values help us decide if mismatches are true polymorphisms or sequencing errors• accurate base qualities are crucial, especially in lower coverage

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

Priors for specific resequencing scenarios

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

AACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

strain 1

strain 2

strain 3

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

Consensus sequence generation (genotyping)

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA

strain 1

strain 2

strain 3

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

AACGTTCGCATAAACGTTCGCATA

A

C

A

A/C

C/C

A/A

SNP calling in Roche/454 pyrosequences

SNP calling in low 454 coverage

• with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total)• can we detect SNPs in survey-style 454 read coverage?

DNA courtesy of Chuck Langley, UC Davis

iso-1 reference

46-2 454 read

46-2 ABI reads (2 fwd + 2 rev)

• 92.9 % validation rate (1,342 / 1,443)• 2.0% missed SNP rate (25 / 1247)

SNP calling in Illumina/Solexa short-reads

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

• SNP calling error rate very low:

Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)

SNP

INS

• INDEL candidates validate and convert at similar rates to SNPs:

Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

A C G G T C G T C G T G T G C G T

A C G G T C G T C G T G T G C G T

A C G G T C G C C G T G T G C G T

A C G G T C G T C G T G T G C G T

No change

SNP

Measurementerror

SNP calling in AB/SOLiD color-space reads

Mutational profiling: deep 454/Illumina/SOLiD data

• collaboration with Doug Smith at Agencourt

• Pichia stipitis converts xylose to ethanol (bio-fuel production)

• one mutagenized strain had especially high conversion

efficiency

• determine where the mutations were that caused this

phenotype

• we resequenced the 15MB genome with 454 Illumina, and

SOLiD reads

• 14 true point mutations in the entire genome

• In about 15X nominal coverage each technology can find

every point mutation with essentially no false positives

Pichia stipitis reference sequence

Image from JGI web site

Our software is available for testing

http://bioinformatics.bc.edu/marthlab/Beta_Release

Credits

http://bioinformatics.bc.edu/marthlab

Elaine Mardis (Washington University)Andy Clark (Cornell University)Doug Smith (Agencourt)

Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

Recommended