Upload
freya-hardy
View
27
Download
2
Embed Size (px)
DESCRIPTION
SVMs Learn a Function to Distinguish between Positive and Negative based on the statistics of the features in the training examples. A Novel Approach To Diploid Base Calling. Aaron R. Quinlan ([email protected]) and Gabor T. Marth ([email protected]), - PowerPoint PPT Presentation
Citation preview
Aaron R. Quinlan ([email protected]) and Gabor T. Marth ([email protected]), Department of Biology, Boston College, Chestnut Hill, MA 02467
http://bioinformatics.bc.edu/marthlab/
Our SNP detection method, detects SNPs across clonal reads basedon base composition and quality.
1
3
4
C/T
P(TT|R) =.9991
P(CT|R) =.96
Method for Diploid Base Calling (Support Vector Machine - based)
Collect Heterozygous and Homozygous
Training Examples
Calculate indicative featuresthat separate heterozygotes
from homozygotes.
SNP 1 SNP 2 SNP N…
Trained SVM Can Separate Unseen
Homozygotes and HeterozygotesMake Diploid Base Calls on Unseen Alignments.
P(CT|R) = .34
P(CT|R) = .01
P(AC|R) = .999
P(AT|R) = .001
P(S1S2|R) = Probability of allelic combination given the read
SVM
SVMs Learn a Function to Distinguish between Positive and Negative based on the statistics of the features in the training examples.
We are integrating diploid base calling (heterozygote detection) into
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
Base Call/Quality Polymorphism Rate
Base Composition Depth of Alignment
Probability of polymorphism
Assessing the Accuracy of the Initial Prototype:
1 211 12 1
11 12 1
11 12 1 2
11 12 1 2 1
11 12 1 1 2prior 11 12 1 2
prior 11 12 prior 1 2
prior
prior prior
Pr , , , , | , ,
Pr , | Pr , |Pr , , , ,
Pr , Pr ,
Pr , |Pr , |Pr , , , ,
Pr , Pr ,n n n
n
n n
n n n
n n nn n
n n
i i ii i i
i i i i
i i i i
S S S S
S S S SS S S S
S S S S
S SS SS S S S
S S S S
R R
R R
RR
2
11 12 1 2every , , , ,
n
i i i in nS S S S
“Unseen” Alignments
SNP (A/G) FoundAcross MultipleClonal Reads
PCR-basedsequences of diploid individualsCalls = (CC, CT, TT)
P(CC|R) =.9995
P(CT|R) =.003
Summary:1.We built a diploid base
calling prototype from the ground up. The initial prototype’s performance is similar to Polyphred 5.
2.We are currently compiling a larger example set to improve accuracy.
3.Our method incorporates information from multiple reads for a given individual in a statistically-rigorous fashion.
4.This prototype represents the first major expansion of .
5.We are currently working to expand the prototype to a production-ready application
2
1Probability of each possible diploid base call (AA,CC,GG,TT,AC,AG,AT,CG,CT,GT)
Each Possible Diploid Base Call/Probability
Prior Probability of Each Diploid Genotype
Depth of Alignment
Observed Diploid Variations/Probabilities
SVM Score+ is Het
- is Hom
P(H
et)
+0
-
1
Utilizing multiple reads per individual, we can make an individual genotype call.
Forward ReadReverse Read
P(GT | Read) = .98
P(GT | Read) = .87
Individual Genotype Call: P(GT) = .993
Prior(GT Frequency) = .34?
Rationale: The accuracy of the consensus diploid base call for an individual increases with the number of reads available for that individual.
Polyphred 5 was tested with the following settings: quality = 21, score = 99, source, ref_comp
0
Convert SVM Score to P(Het)
Assessing the Genotyping Accuracy of the Initial Prototype
Objective: To enhance with an accurate diploid base calling algorithm
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0-.1 .1-.2 .2-.3 .3-.4 .4-.5 .5-.6 .6-.7 .7-.8 .8-.9 .9-.95 .95-.99 .99-1
P(Het) Range
Tru
e P
osi
tive
Rat
e
Accuracy
A Novel Approach To Diploid Base Calling
P(CT) = .9
P(CC) = .045
P(TT) = .045P(Others) = .01
Probability of Each Genotype
From a diploid base call
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
<.1 <.2 <.3 <.4 <.5 <.6 <.7 <.8 <.9 <.95 <.99 <1
P(Het) Limit
Fra
ctio
n o
f C
alla
ble
Het
s F
ou
nd
SensitivityNumber of Calls within Posterior Probability Range
0
200
400
600
800
1000
1200
1400
1600
1800
2000
P(Het) Range
Nu
mb
er o
f C
alls
mad
e
21851 Data Accuracy by P(Het) Score
Number of Alignments Analyzed: 993Total Number of Read Positions: 231874Total Number of Heterozygotes: 31411Total Number of Homozygotes: 143370
Note: Polyphred was tested on alignments created by PolyBayes. This allowed Polyphred to analyze a larger fraction of reads, asCompared to Phrap Alignments.