Aaron R. Quinlan (quinlaaa@bc) and Gabor T. Marth (marth@bc),

Preview:

DESCRIPTION

SVMs Learn a Function to Distinguish between Positive and Negative based on the statistics of the features in the training examples. A Novel Approach To Diploid Base Calling. Aaron R. Quinlan (quinlaaa@bc.edu) and Gabor T. Marth (marth@bc.edu), - PowerPoint PPT Presentation

Citation preview

Aaron R. Quinlan (quinlaaa@bc.edu) and Gabor T. Marth (marth@bc.edu), Department of Biology, Boston College, Chestnut Hill, MA 02467

http://bioinformatics.bc.edu/marthlab/

Our SNP detection method, detects SNPs across clonal reads basedon base composition and quality.

1

3

4

C/T

P(TT|R) =.9991

P(CT|R) =.96

Method for Diploid Base Calling (Support Vector Machine - based)

Collect Heterozygous and Homozygous

Training Examples

Calculate indicative featuresthat separate heterozygotes

from homozygotes.

SNP 1 SNP 2 SNP N…

Trained SVM Can Separate Unseen

Homozygotes and HeterozygotesMake Diploid Base Calls on Unseen Alignments.

P(CT|R) = .34

P(CT|R) = .01

P(AC|R) = .999

P(AT|R) = .001

P(S1S2|R) = Probability of allelic combination given the read

SVM

SVMs Learn a Function to Distinguish between Positive and Negative based on the statistics of the features in the training examples.

We are integrating diploid base calling (heterozygote detection) into

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

Base Call/Quality Polymorphism Rate

Base Composition Depth of Alignment

Probability of polymorphism

Assessing the Accuracy of the Initial Prototype:

1 211 12 1

11 12 1

11 12 1 2

11 12 1 2 1

11 12 1 1 2prior 11 12 1 2

prior 11 12 prior 1 2

prior

prior prior

Pr , , , , | , ,

Pr , | Pr , |Pr , , , ,

Pr , Pr ,

Pr , |Pr , |Pr , , , ,

Pr , Pr ,n n n

n

n n

n n n

n n nn n

n n

i i ii i i

i i i i

i i i i

S S S S

S S S SS S S S

S S S S

S SS SS S S S

S S S S

R R

R R

RR

2

11 12 1 2every , , , ,

n

i i i in nS S S S

“Unseen” Alignments

SNP (A/G) FoundAcross MultipleClonal Reads

PCR-basedsequences of diploid individualsCalls = (CC, CT, TT)

P(CC|R) =.9995

P(CT|R) =.003

Summary:1.We built a diploid base

calling prototype from the ground up. The initial prototype’s performance is similar to Polyphred 5.

2.We are currently compiling a larger example set to improve accuracy.

3.Our method incorporates information from multiple reads for a given individual in a statistically-rigorous fashion.

4.This prototype represents the first major expansion of .

5.We are currently working to expand the prototype to a production-ready application

2

1Probability of each possible diploid base call (AA,CC,GG,TT,AC,AG,AT,CG,CT,GT)

Each Possible Diploid Base Call/Probability

Prior Probability of Each Diploid Genotype

Depth of Alignment

Observed Diploid Variations/Probabilities

SVM Score+ is Het

- is Hom

P(H

et)

+0

-

1

Utilizing multiple reads per individual, we can make an individual genotype call.

Forward ReadReverse Read

P(GT | Read) = .98

P(GT | Read) = .87

Individual Genotype Call: P(GT) = .993

Prior(GT Frequency) = .34?

Rationale: The accuracy of the consensus diploid base call for an individual increases with the number of reads available for that individual.

Polyphred 5 was tested with the following settings: quality = 21, score = 99, source, ref_comp

0

Convert SVM Score to P(Het)

Assessing the Genotyping Accuracy of the Initial Prototype

Objective: To enhance with an accurate diploid base calling algorithm

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0-.1 .1-.2 .2-.3 .3-.4 .4-.5 .5-.6 .6-.7 .7-.8 .8-.9 .9-.95 .95-.99 .99-1

P(Het) Range

Tru

e P

osi

tive

Rat

e

Accuracy

A Novel Approach To Diploid Base Calling

P(CT) = .9

P(CC) = .045

P(TT) = .045P(Others) = .01

Probability of Each Genotype

From a diploid base call

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

<.1 <.2 <.3 <.4 <.5 <.6 <.7 <.8 <.9 <.95 <.99 <1

P(Het) Limit

Fra

ctio

n o

f C

alla

ble

Het

s F

ou

nd

SensitivityNumber of Calls within Posterior Probability Range

0

200

400

600

800

1000

1200

1400

1600

1800

2000

P(Het) Range

Nu

mb

er o

f C

alls

mad

e

21851 Data Accuracy by P(Het) Score

Number of Alignments Analyzed: 993Total Number of Read Positions: 231874Total Number of Heterozygotes: 31411Total Number of Homozygotes: 143370

Note: Polyphred was tested on alignments created by PolyBayes. This allowed Polyphred to analyze a larger fraction of reads, asCompared to Phrap Alignments.

Recommended