26
New Strategy to detect SNPs Miguel Galves José Augusto Quitzau Zanoni Dias Scylla Bioinformatics –Brazil {miguel,jquitzau,zanoni}@scylla. com.br

New Strategy to detect SNPs

Embed Size (px)

DESCRIPTION

New Strategy to detect SNPs

Citation preview

Page 1: New Strategy to detect SNPs

New Strategy to detect SNPs

Miguel Galves

José Augusto Quitzau

Zanoni Dias

Scylla Bioinformatics –Brazil{miguel,jquitzau,zanoni}@scylla.com.br

Page 2: New Strategy to detect SNPs

Agenda

Introduction HIV Dataset Detection Strategy Trimming Procedure Base-Calling Strategies Filter Algorithm Consensus Algorithm Tests Protocol Results Discussion

Page 3: New Strategy to detect SNPs

Introduction

Polymorphism: set of base pair locus at which different alleles exists in individuals in some population– The second most frequent allele must appear in

at least 1% of the individuals SNP: polymorphism in a single base pair

position SNP discovery is very important to

understand complex diseases

Page 4: New Strategy to detect SNPs

HIV Dataset

HIV genetic sequences:– 1302 bp– Well-conserved region

35 batches from 35 individuals:– 6 PCR reads, with average size of 690bp– 1 validated sequence, with manually annotated

SNPs

HIV Reference Sequence

Page 5: New Strategy to detect SNPs

Detection Strategy: Survey

Trimming Procedure Base-Calling Correction SNPs Filter Batch Consensus Algorithm

Page 6: New Strategy to detect SNPs

Trimming Procedure

Low Quality Ends filtering Converts phred’s quality sequence to error

probability sequence: Q = -10 x log10(p)

Subtract 0.05 from all values (Q=13) Maximum Score Subsequence Algorithm

Page 7: New Strategy to detect SNPs

Base Calling: Area Ratio

The base calling is made in 5 Steps:1. Chromatogram area delimitation

2. Peak search

3. Choice of the nearest peaks

4. Calculation of the nearest peaks area

5. Calculation of the polymorphic/reference peak area

If the calculated ratio is above a certain threshold, the point is considered a polymorphism.

Page 8: New Strategy to detect SNPs

Base Calling: Area Delimitation

Page 9: New Strategy to detect SNPs

Base Calling: Peak Identification

Page 10: New Strategy to detect SNPs

Base Calling: Average Height Ratio

Almost the same steps:1. Chromatogram area delimitation2. Peak search3. Choice of the nearest peaks4. Calculation of the nearest peaks average height5. Calculation of the polymorphic/reference peak average

height.

Again, if the calculated ratio is above a certain threshold, the point is considered a polymorphism.

Page 11: New Strategy to detect SNPs

Base Calling: Peak Identification

Page 12: New Strategy to detect SNPs

Filter Algorithm

Analyzes each sequence Uses a window based algorithm to eliminate

adjacents SNPs– Window size: 11 bases– Empirical score system assigned to polymorphism

in the window

Page 13: New Strategy to detect SNPs

Consensus Algorithm

Rule-based algorithm– Empirical rules

Analyzes the whole cross section to define a consensus– Take account of nucleotide frequencies and

qualities

Do not create N symbols, nor tri-allelic polymorphisms.

Page 14: New Strategy to detect SNPs

Consensus Algorithm: Example

Sequence 1 A25 C30 C18 C30 A21

Sequence 2 A30 C25 C15 C25 A16

Sequence 3 - M18 A9 C30 -

Sequence 4 - - S12 G17 T18

Consensus A M S S W

Page 15: New Strategy to detect SNPs

Tests Protocol: Third Party Packages

Two external packages used to compare our results:– Polybayes: SNP detection tool based on Bayesian

Methods– Polyphred: SNP detection tool based on chromatogram

analysis

ACE file (contig and consensus) created for each batch using phrap

ACE file analyzed by Polyphred and Polybayes Results viewed with consed

Page 16: New Strategy to detect SNPs

Tests Protocol: Our strategy

Reads trimmed using Maximum Subsequence Algorithm

Base-calling analysis and correction using algorithms describe previously

SNP filtering Multiple alignment

– Reference sequence as anchor

Consensus creation

Page 17: New Strategy to detect SNPs

Third Party Results: Polybayes

Polybayes detected SNPs in only 2 batches out of 35

Batch Existing SNPs

Detected SNPs

Correct SNPs

False Positives

False Negatives

Batch 13 12 1 1 0 11

Batch 15 5 1 0 1 5

Page 18: New Strategy to detect SNPs

Third Party Results: Polyphred

Polyphred detected SNPs in only 4 batches out of 35

Batch Existing SNPs

Detected SNPs

Correct SNPs

False Positives

False Negatives

Batch 07 10 1 0 1 10

Batch 14 4 3 0 3 4

Batch 32 26 1 0 1 26

Batch 35 15 8 1 7 14

Page 19: New Strategy to detect SNPs

Trimming Results

Reads average size:– Before trimming: 690.15bp– After trimming: 374.74bp– Reduction of 45%

Reference sequence average base coverage– Before trimming: 2.69– After trimming: 1.77

Page 20: New Strategy to detect SNPs

Results: True Positive (%) x batch

Page 21: New Strategy to detect SNPs

Results: False Negative (%) x batch

Page 22: New Strategy to detect SNPs

Results: False Positive (%) x batch

Page 23: New Strategy to detect SNPs

Results: Summary

Polybayes Polyphred Area Avg. Height

Avg SD Avg SD Avg SD Avg SD

TP 0.3 1.4 0.2 1.1 75.4 19.2 52.6 21.5

FN 99.7 1.4 99.8 1.1 23.2 18.4 45.6 21.7

DP 0.0 0.0 0.0 0.0 1.4 4.3 1.8 4.0

FP 2.9 16.9 11.1 31.3 393.9 312.3 554.4 511.3

TP + FN + DP = 100%

Page 24: New Strategy to detect SNPs

Discussion

Polybayes and Polyphred need large sets of data to produces good results

Our algorithm produces quite satisfactory results taking into account data characteristics:

– Low average coverage– High amount of low quality bases– High amount of polymorphisms (virus DNA)

Area Ratio strategy produces better results than Average Height strategy

Page 25: New Strategy to detect SNPs

Future Work

Test the algorithms whith larger batches, whith higher average coverage, to improve consensus algorithm

Reproduce the experiments using genetic sequences of more conserved life forms, such as mammals

Page 26: New Strategy to detect SNPs

Acknowledgments