View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Structural Phylogenomic Analysis
Estimate Tree of Life; plot key traits onto tree
VirB4 model
Predict active site & subfamily specificity positions
Anti-fungal defensin(Radish)
Scorpion toxin
Extend function prediction through inclusion of structure prediction and analysis
Drosomycin(Drosophila)
Based on 12% identity to TrwB structure
Annotation transfer by homology
• Status quo approach to protein function prediction – Given a gene (or protein) of unknown function
• Run BLAST to find homologs• Identify the top BLAST hit(s)• If the score is significant, transfer the annotation
– If resources permit, predict domains using PFAM or CDD
• Problems: – Approach fails completely for ~30% of genes – Of those with annotations, only 3% have any supporting experimental evidence
• 97% have had functions predicted by homology alone*
– High error rate
* Based on analysis of >300K proteins in the UniProt database
Database annotation errorsDatabase annotation errors
Main sources of annotation errors:1. Domain shuffling2. Gene duplication (failure to discriminate
between orthologs and paralogs) 3. Existing database annotation errors
Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement,
non-orthologous gene displacement and operon disruption.”In Silico Biol. 1998
Sub-functionalizationNeo-functionalization
Propagation of existing database annotation errors
Errors in gene structure
Berkeley Phylogenomics Group
Tomato Cf-2 Bioinformatics AnalysisDomain fusion and fission events complicate function prediction by homology, particularly for particularly common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often critical.
BLAST against Arabidopsis
PFAM results
Panther
Tomato Cf-2 (GI:1587673)Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG Cell (1996)
Top BLAST hit in Arabidopsis is an RLK!
Plant and Animal Innate Immunity Mediated by Structurally Similar Receptor and Receptor-like molecules
Cytoplasmic Toll Interleukin 1 Receptor (TIR) domain
Domain fusion/fission
TM
Errors due to domain shufflingErrors due to domain shuffling
(sic)
Error presumably due to non-orthology of database hits used for annotation
The top matching BLAST hits are putative odorant receptors
Phylogenetic analysissuggests it’smore likelya BiogenicAmine GPCR
Annotation error (source unknown)Annotation error (source unknown)
Phylogenomic inferencePhylogenomic inference
Eisen, 1998Sjölander, Bioinformatics 2004
Human, Chimp, Mouse, Rat, Fly, Worm
H1 C1 M1 R1 F1 W1 H2 C2 M2 R2 F2 W2
Gene duplication in ancestral organism
SCI-PHY analysis of selected GPCRsSCI-PHY analysis of selected GPCRs
Venter et al, The sequence of the human genome (2001) Science.
Sjolander, “"Phylogenomic inference of protein molecular function: advances and challenges," (2004) Bioinformatics
Phylogenetic reconstruction of protein families is complicated
• Gene duplication• Domain shuffling• Lessening of evolutionary pressures associated with speciation
and duplication enable significant structural and sequence changes
• Different mutation rates in some lineages• Different types of constraints at some positions
• Multiple sequence alignment errors• What members to include? (Some families contain thousands of
members)
Caveats• Sequence “signal” guides the alignment
• If the signal is weak, the alignment can be poor
• As proteins diverge from a common ancestor, their structures and functions can change – Even structural superposition can be challenging!
• Repeats, domain shuffling, large insertions or deletions can introduce alignment errors
• If tree construction is the aim, errors in the alignment will affect tree accuracy!
Fundamental Fundamental mechanisms mechanisms underlying underlying evolution of evolution of
gene gene familiesfamilies
Drosomycin, Antifungal proteinFruit Fly
Homology and adaptation Homology and adaptation among protein familiesamong protein families
1AYJ
Antifungal protein 1 (RS-AFP1) Radish
1BK8Antimicrobial Protein 1 (Ah-Amp1)Common horse chestnut
1AGT
Agitoxin 2Egyptian Scorpion
(K+ channel inhibitor)
1CN2Toxin 2
Mexican scorpion(Na+ channel inhibitor)
Protein superfamilies evolve novel forms and
functions:
Homology may be hard to detect from sequence
similarity alonePairwise alignment MSA-pw Sequence-profile methods
%ID #pair %Superpos BLAST ClustalW Tcoffee MAFFT MUSCLE HMM TreeHMM TreeHMM-Opt
>70 107 90.6 0.954 0.955 0.955 0.955 0.954 0.954 0.951 0.954 0.96
50-70 63 87.2 0.862 0.903 0.894 0.901 0.919 0.911 0.903 0.904 0.929
40-50 46 83.4 0.824 0.872 0.855 0.856 0.862 0.846 0.855 0.855 0.934
30-40 65 85.4 0.811 0.874 0.867 0.87 0.892 0.925 0.899 0.892 0.953
25-30 41 82.1 0.779 0.782 0.788 0.795 0.837 0.836 0.868 0.866 0.91
20-25 53 77.9 0.612 0.599 0.627 0.633 0.678 0.661 0.727 0.728 0.813
15-20 84 73 0.381 0.451 0.457 0.49 0.496 0.554 0.578 0.572 0.72
10-15 151 64.4 0.16 0.186 0.234 0.302 0.35 0.351 0.387 0.363 0.551
5-10 204 50.4 -0.007 -0.014 0 -0.047 0.098 0.075 0.096 0.085 0.29
0-5 122 39.5 -0.033 -0.049 -0.051 -0.034 -0.024 -0.022 -0.026 -0.025 0.127
Homology detection and alignment accuracy (and %superposable positions!) drops with evolutionary distance
Structure can provide clues, but not necessarily exact definition
Not all positions in a Not all positions in a molecule are created equal molecule are created equal
Light-blue positions are variable across subfamilies – but can be very conserved within subfamilies.These are the hallmarks of binding pockets determining substrate specificity.
A B CA C B B C A
Major differences between Major differences between trees are in the coarse trees are in the coarse
branching orderbranching order
A B CA C B B C A
When each class, A, B and C appear equally similar to each other, the coarse branching order can be difficult to determine. In this case, it’s critical to be able to weight the subfamily-defining residues as more important when computing the distance between classes.
HMM construction using an initial multiple sequence alignment
Seq1 M V V S - - PSeq2 M V V S T G PSeq3 M V V S S G PSeq4 M V L S S P PSeq5 M - L S G P P
Delete/skip
Insert
Match
Profile or HMM parameter estimation using small training sets
D S I F M KD S V F M KD T I W M KD T I W M KD T V W M K
.
What other amino acids might be seen at this position among homologs? What are their probabilities?
The context is critical when estimating amino acid distributions
D S I F M KD S V F M KD T I W M KD T I W L KD T L W L R
.
This position may be critical for function or structure, and may not allow substitutions
Dirichlet Mixture Prior “Blocks9”
Parameters estimated using Expectation Maximization (EM) algorithm. Training data: 86,000 columns from BLOCKS alignment database.
Combining Prior Knowledge with Combining Prior Knowledge with Observations using Dirichlet Mixture Observations using Dirichlet Mixture
DensitiesDensities
pi = the estimated probability of amino acid ‘i’
n = (n1,…,n20) = the count vector summarizing the observed amino acids at a position.
j = ( j,1 ,…, j,20 ) = the parameters of component j of the
Dirichlet mixture .
ˆ
Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Sjolander, Karplus, Brown, Hughey, Krogh, Mian and Haussler. CABIOS (1996)
SATCHMO: Simultaneous Alignment and Tree Construction using Hidden Markov mOdels
Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11
Xia JiangNandini Krishnamurthy
Duncan BrownMichael Tung
Jake Gunn-GlanvilleBob Edgar
SATCHMO motivation• Structural divergence within a superfamily means that…
– Multiple sequence alignment (MSA) is hard– Alignable positions varies according to degree of divergence
• Current MSA methods not designed to handle this variability
• Assume globally alignable, all columns (e.g. ClustalW)…– Over-aligns, i.e. aligns regions that are not superposable
• …or identify and align only highly conserved positions (profile HMMs)– Discards information important for subfamily specificity
• Reality– Different degrees of alignability in different sequence pairs, different
regions
Agglomerative clusteringAgglomerative clusteringAlgorithm:
Initialize all objects to be separate classes (leaves in the tree).
Join “closest” classes (connecting each by edges to a node).
Compute distance between new class and other classes.
Join closest two classes.
Iterate until all classes are joined into one class (a tree)
SATCHMO output
1. Tree• Cluster based on structural “distance”• Built simultaneously with alignments
2. Multiple sequence alignments• Different alignment for each cluster
(=each node in tree)
3. Prediction of alignable / non-alignable regions
• 1,2,3 mutually dependent, inform each other– Interact each time two clusters are combined
Note: we can assess alignment quality, but assessment of tree topology accuracy is not straightforward to estimate.
SATCHMO algorithm: Progressive profile-profile alignment
• Typical state: set of subtrees
– Cluster (=subtree) contains
• alignment of all subtree sequences
• profile HMM
– Initialization: each sequence forms a leaf in tree
• Iterated step
– Find most closely related pair of subtrees (using HMM scoring)
– Align the MSAs of the two clusters using profile-profile alignment…
– …treats MSA column as single “letter”, keeps columns intact
– Result: new cluster with its own MSA
– Predict “alignable” columns, and build profile HMM (w/Dirichlet mixture densities).
Assessing sequence alignment Assessing sequence alignment with respect to structural alignmentwith respect to structural alignment
Xia Jiang Duncan Brown Nandini Krishnamurthy
Alignment accuracy as a function of % ID(including homologs, full-length sequences)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10-15% 15-20% 20-25% 25-30% 30-35% 35-40%Percent ID
Average CS score
CLUSTALW MUSCLE MAFFT SATCHMO
Alignment of proteins with different overall folds
Summary
• SATCHMO is designed to provide for the assumption of ‘positional homology’ during the tree estimation process
• This assumption -- that we can predict the structurally equivalent positions from sequence information alone -- needs to be tested
• We need a benchmark dataset to evaluate phylogenetic tree topology estimation