1
Aaron R. Quinlan ([email protected]) and Gabor T. Marth ([email protected]), Department of Biology, Boston College, Chestnut Hill, MA 02467 http://bioinformatics.bc.edu/marthlab/ Our SNP detection method, detects SNPs across clonal reads based on base composition and quality. 1 3 4 C/T P(TT|R) =.9991 P(CT|R) =.96 Method for Diploid Base Calling (Support Vector Machine - based) Collect Heterozygous and Homozygous Training Examples Calculate indicative features that separate heterozygotes from homozygotes. SNP 1 SNP 2 SNP N Trained SVM Can Separate Unseen Homozygotes and Heterozygotes Make Diploid Base Calls on Unseen Alignments. P(CT|R) = .34 P(CT|R) = .01 P(AC|R) = .999 P(AT|R) = .001 P(S 1 S 2 |R) = Probability of allelic combination given the read SVM SVMs Learn a Function to Distinguish between Positive and Negative based on the statistics of the features in the training examples. We are integrating diploid base calling (heterozygote detection) into S iable var all ] T , G , C , A [ S ] T , G , C , A [ S i i ior Pr i ior Pr i i ior Pr i N ior Pr N ior Pr N N ior Pr i N i N N N ) S ,..., S ( P ) S ( P ) R | S ( P ... ) S ( P ) R | S ( P ... ) S ,..., S ( P ) S ( P ) R | S ( P ... ) S ( P ) R | S ( P ) SNP ( P 1 1 1 1 1 1 1 1 1 1 Base Call/Quality Polymorphism Rate Base Composition Depth of Alignment Probability of polymorphism Assessing the Accuracy of the Initial Prototype: 1 2 11 12 1 11 12 1 11 12 1 2 11 12 1 2 1 11 12 1 1 2 prior 11 12 1 2 prior 11 12 prior 1 2 prior prior prior Pr , , , , | , , Pr , | Pr , | Pr , , , , Pr , Pr , Pr , | Pr , | Pr , , , , Pr , Pr , n n n n n n n n n n n n n n n n i i i i i i i i i i i i i i S S S S S S S S S S S S S S S S S S S S S S S S S S S S R R R R R R 2 11 12 1 2 every , ,, , n i i i i n n S S S S “Unseen” Alignments SNP (A/G) Found Across Multiple Clonal Reads PCR-based sequences of diploid individuals Calls = (CC, CT, TT) P(CC|R) =.9995 P(CT|R) =.003 Summary: 1.We built a diploid base calling prototype from the ground up. The initial prototype’s performance is similar to Polyphred 5. 2.We are currently compiling a larger example set to improve accuracy. 3.Our method incorporates information from multiple reads for a given individual in a statistically- rigorous fashion. 4.This prototype represents the first major expansion of 2 1 Probability of each possible diploid base call (AA,CC,GG,TT,AC,AG,AT,CG,CT,GT) Each Possible Diploid Base Call/Probability Prior Probability of Each Diploid Genotype Depth of Alignment Observed Diploid Variations/Probabilities SVM Score + is Het - is Hom P(Het) + 0 - 1 Utilizing multiple reads per individual, we can make an individual genotype call. Forward Read Reverse Read P(GT | Read) = .98 P(GT | Read) = .87 Individual Genotype Call: P(GT) = .993 Prior(GT Frequency) = .34 ? Rationale: The accuracy of the consensus diploid base call for an individual increases with the number of reads available for that individual. yphred 5 was tested with the following settings: quality = 21, score = 99, source, ref_comp 0 Convert SVM Score to P(Het) Assessing the Genotyping Accuracy of the Initial Prototype Objective: To enhance with an accurate diploid base calling algorithm 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0-.1 .1-.2 .2-.3 .3-.4 .4-.5 .5-.6 .6-.7 .7-.8 .8-.9 .9-.95 .95-.99 .99-1 P (H et)R ange True P ositive R ate Accuracy A Novel Approach To Diploid Base Calling P(CT) = .9 P(CC) = .045 P(TT) = .045 P(Others) = .01 Probability of Each Genotype From a diploid base call 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 <.1 <.2 <.3 <.4 <.5 <.6 <.7 <.8 <.9 <.95 <.99 <1 P(H et)Lim it Fraction ofC allable H ets Found Sensitivity Num ber ofC alls w ithin Posterior Probability R ange 0 200 400 600 800 1000 1200 1400 1600 1800 2000 P(Het)Range Num berofC alls m ade 21851 Data Accuracy by P(Het) Score Number of Alignments Analyzed: 993 Total Number of Read Positions: 231874 Total Number of Heterozygotes: 31411 Total Number of Homozygotes: 143370 Note: Polyphred was tested on alignments created by PolyBayes. This allowed Polyphred to analyze a larger fraction of reads, as Compared to Phrap Alignments.

Aaron R. Quinlan (quinlaaa@bc) and Gabor T. Marth (marth@bc),

Embed Size (px)

DESCRIPTION

SVMs Learn a Function to Distinguish between Positive and Negative based on the statistics of the features in the training examples. A Novel Approach To Diploid Base Calling. Aaron R. Quinlan ([email protected]) and Gabor T. Marth ([email protected]), - PowerPoint PPT Presentation

Citation preview

Page 1: Aaron R. Quinlan  (quinlaaa@bc) and  Gabor T. Marth  (marth@bc),

Aaron R. Quinlan ([email protected]) and Gabor T. Marth ([email protected]), Department of Biology, Boston College, Chestnut Hill, MA 02467

http://bioinformatics.bc.edu/marthlab/

Our SNP detection method, detects SNPs across clonal reads basedon base composition and quality.

1

3

4

C/T

P(TT|R) =.9991

P(CT|R) =.96

Method for Diploid Base Calling (Support Vector Machine - based)

Collect Heterozygous and Homozygous

Training Examples

Calculate indicative featuresthat separate heterozygotes

from homozygotes.

SNP 1 SNP 2 SNP N…

Trained SVM Can Separate Unseen

Homozygotes and HeterozygotesMake Diploid Base Calls on Unseen Alignments.

P(CT|R) = .34

P(CT|R) = .01

P(AC|R) = .999

P(AT|R) = .001

P(S1S2|R) = Probability of allelic combination given the read

SVM

SVMs Learn a Function to Distinguish between Positive and Negative based on the statistics of the features in the training examples.

We are integrating diploid base calling (heterozygote detection) into

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

Base Call/Quality Polymorphism Rate

Base Composition Depth of Alignment

Probability of polymorphism

Assessing the Accuracy of the Initial Prototype:

1 211 12 1

11 12 1

11 12 1 2

11 12 1 2 1

11 12 1 1 2prior 11 12 1 2

prior 11 12 prior 1 2

prior

prior prior

Pr , , , , | , ,

Pr , | Pr , |Pr , , , ,

Pr , Pr ,

Pr , |Pr , |Pr , , , ,

Pr , Pr ,n n n

n

n n

n n n

n n nn n

n n

i i ii i i

i i i i

i i i i

S S S S

S S S SS S S S

S S S S

S SS SS S S S

S S S S

R R

R R

RR

2

11 12 1 2every , , , ,

n

i i i in nS S S S

“Unseen” Alignments

SNP (A/G) FoundAcross MultipleClonal Reads

PCR-basedsequences of diploid individualsCalls = (CC, CT, TT)

P(CC|R) =.9995

P(CT|R) =.003

Summary:1.We built a diploid base

calling prototype from the ground up. The initial prototype’s performance is similar to Polyphred 5.

2.We are currently compiling a larger example set to improve accuracy.

3.Our method incorporates information from multiple reads for a given individual in a statistically-rigorous fashion.

4.This prototype represents the first major expansion of .

5.We are currently working to expand the prototype to a production-ready application

2

1Probability of each possible diploid base call (AA,CC,GG,TT,AC,AG,AT,CG,CT,GT)

Each Possible Diploid Base Call/Probability

Prior Probability of Each Diploid Genotype

Depth of Alignment

Observed Diploid Variations/Probabilities

SVM Score+ is Het

- is Hom

P(H

et)

+0

-

1

Utilizing multiple reads per individual, we can make an individual genotype call.

Forward ReadReverse Read

P(GT | Read) = .98

P(GT | Read) = .87

Individual Genotype Call: P(GT) = .993

Prior(GT Frequency) = .34?

Rationale: The accuracy of the consensus diploid base call for an individual increases with the number of reads available for that individual.

Polyphred 5 was tested with the following settings: quality = 21, score = 99, source, ref_comp

0

Convert SVM Score to P(Het)

Assessing the Genotyping Accuracy of the Initial Prototype

Objective: To enhance with an accurate diploid base calling algorithm

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0-.1 .1-.2 .2-.3 .3-.4 .4-.5 .5-.6 .6-.7 .7-.8 .8-.9 .9-.95 .95-.99 .99-1

P(Het) Range

Tru

e P

osi

tive

Rat

e

Accuracy

A Novel Approach To Diploid Base Calling

P(CT) = .9

P(CC) = .045

P(TT) = .045P(Others) = .01

Probability of Each Genotype

From a diploid base call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

<.1 <.2 <.3 <.4 <.5 <.6 <.7 <.8 <.9 <.95 <.99 <1

P(Het) Limit

Fra

ctio

n o

f C

alla

ble

Het

s F

ou

nd

SensitivityNumber of Calls within Posterior Probability Range

0

200

400

600

800

1000

1200

1400

1600

1800

2000

P(Het) Range

Nu

mb

er o

f C

alls

mad

e

21851 Data Accuracy by P(Het) Score

Number of Alignments Analyzed: 993Total Number of Read Positions: 231874Total Number of Heterozygotes: 31411Total Number of Homozygotes: 143370

Note: Polyphred was tested on alignments created by PolyBayes. This allowed Polyphred to analyze a larger fraction of reads, asCompared to Phrap Alignments.