18
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Embed Size (px)

Citation preview

Page 1: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

ZORRO : A masking program for incorporating Alignment Accuracy in

Phylogenetic Inference

Sourav ChatterjiMartin Wu

Page 2: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu
Page 3: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Probabilistic Masking using pair-HMMs

• Probabilistic formulation of alignment problem.

• Can answer additional questions– Alignment Reliability– Sub-optimal Alignments

Durbin et al., Cambridge University Press (1998)

Page 4: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Probabilistic Masking

• What is the probability residues xi and yj are homologous?

• Posterior Probability the residues xi and yj are homologous

• Can be calculated efficiently for all pairs (and gaps) in quadratic time.

y]y]Pr[x,Pr[x,y]y]x,x,,,yyPr[xPr[x

]]yyPr[xPr[x jjiijjii

Page 5: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

An Ideal Weighting Scheme

• Accounts for correlations between pairs– e.g. A-C and A-D

• Accounts for distance between the sequences in a pair– e.g. C-D

Page 6: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

The Zorro Weighting Scheme

4 3

3

3 3

Calculate Ne, the number of pairs that share an edge e.

Page 7: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

The Zorro Weighting Scheme

4 3

3

3 3

•Normalize the edge weight by Ne.•Weight of a pair is sum of normalized weights of edges on the path.

Page 8: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Scoring Multiple Alignment Columns

• Calculate the “posterior probability matrix” and weights wij for every pair of sequences.

• Weighted “sum of pairs” score for column r :

jji,i,ijij

jjiijji,i,

ijij

ww

]]rrPr[rPr[rww

Page 9: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Some Notes

• Improve Running Time– Sample a subset of pairs– Performance almost similar

• Using Confidence Scores– Cutoff Based Scheme (we use 0.5)– Weighted Sampling of columns according to

confidence scores.

Page 10: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Testing

The Balibase 3.0 Benchmark Database

Page 11: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Testing

• Realign sequences using MSA programs like Clustalw.

• Sensitivity: for all correctly aligned columns, the fraction that has been masked as good

• Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

Page 12: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Performance

Gblocks

ZORRO

Sensitivity Specificity

96.3% 95.1%

54.4% 94.7%

Page 13: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Effect on Phylogenetic Inference

• Gblocks data-set– Protein Sequences obtained by simulating

evolution on known trees– Diversity in data-set• Topology (Symmetric/Asymmetric)• Evolutionary Rates• Alignment Lengths (not tested yet)

Page 14: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Effect on Phylogenetic Inference

Protocol Symmetric Tree Inference Accuracy

Asymmetric Tree Inference Accuracy

No Masking 95.17% 91.95 %

Gblocks 84.14 % 86.44 %

Prob. Masking 93.56% 93.33 %

Clustalw alignments, PhyML tree

Page 15: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Effect on Phylogenetic Inference

Protocol

Symmetric Tree Inference Accuracy

Asymmetric Tree Inference Accuracy

All HighSupport All High

Support

No Masking 94.25 % 69.23% 91.95 % 57.44%

Gblocks 89.2 % 57.44% 90.80 % 51.88%

Prob. Masking 94.02% 68.21% 93.79 % 62.05%

MAFFT alignments, PhyML tree

Page 16: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Effect on Phylogenetic Inference

Clustalw alignments, PhyML tree

Protocol

Symmetric Tree Inference Accuracy

Asymmetric Tree Inference Accuracy

All HighSupport All High

Support

No Masking 95.17% 62.05% 91.95 % 55.38%

Gblocks 84.14 % 41.03% 86.44 % 37.95%

Prob. Masking 93.56% 72.31% 93.33 % 63.59%

Page 17: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Effect on Phylogenetic Inference

Muscle alignments, PhyML tree

Protocol

Symmetric Tree Inference Accuracy

Asymmetric Tree Inference Accuracy

All HighSupport All High

Support

No Masking 94.71% 71.28% 93.10 % 61.03%

Gblocks 89.43 % 57.95% 90.11 % 50.26%

Prob. Masking 93.56% 70.77% 95.17 % 64.62%

Page 18: ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu

Conclusions/Future Work

• Technical Issues– What if a few sequences are “bad”/non-

homologous?– Incorporate reliability in likelihood equation and

Bayesian methods.• With Dr. Darling in July

• Testing– “Real” Data Sets?