View
231
Download
0
Tags:
Embed Size (px)
Citation preview
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein
Motifs
Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,Olivier Lichtarge, Lydia E. Kavraki
Motivation
• Understanding the function of proteins is a fundamental purpose of biology
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
• Experimental determination of protein function is expensive and time consuming
• Algorithms for computational function prediction could guide and accelerate protein function discovery process
A Computational Approach
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
• Comparative Analysis– Focus: Algorithms for Comparative
Analysis
• What is similar about proteins with similar function?– Sequence – same components?– Geometry – same structure?– Dynamics – same motion?
A Computational Approach
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
• Comparative Analysis– Focus: Algorithms for Comparative
Analysis
• What is similar about proteins with similar function?– Sequence – same components?– Geometry – same structure?– Dynamics – same motion?
(Same Chemistry)
What do we need?
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
• A motif for comparison– Representative of Biological function
• An algorithm for comparison– Search for Geometric and Chemical
similarity
• Statistical analysis– Classification of results
Outline
• Evolutionary Trace (ET)– A source of biologically relevant motifs using
evolutionary data
• Match Augmentation (MA)– An algorithm for identifying geometric
similarity
• Statistical analysis– Statistically determined geometric thresholds
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
The Evolutionary Trace (ET)
Lichtarge et al, JMB 1996; Lichtarge et al, JMB 1997; Lichtarge et al, PNAS 1996; Sowa et al, NSB 2001
A A A . E C WG Y R I G C KA K R . D C WG T R L F C LG A K I Y C LG T R I A C KA K K . D C WG Y R L C C LA K Y . E C W
Structure
alignment
tree
+
Functional site
G T R I A C K
G Y R I G C KG Y R L C C LG T R L F C LG A K I Y C L
A A A . E C W
A K K . D C WA K R . D C W
A K Y . E C W
position 1 2 3 4 5 6 7
consensus X - - X - C Xrank 2 - - 4 - 1 3
Evolutionary Trace
rank 4
ET Clusters Functionally Relevant
http://imgen.bcm.tmc.edu/molgenlabs/lichtarge/trace_of_the_week/traces.html
Ligand binding site
ET clusters
Trp1 domain of HopDihydropteroate SynthaseGalectin CRDCluster Type
Structural Epitope : Yellow = ligand, Blue = Residues within 5Å of the ligandET Clusters : Yellow = ligand, Red = Largest Cluster, Other colors = trace residues
Geometric Motifs
• Trace Clusters are functionally relevant
• A source for geometric motifs
• Geometric Motifs Function– Given a protein structure:
• Same Amino Acids • Same Geometry and Chemistry
– Does the protein have the same function?Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
?
Outline
• Evolutionary Trace (ET)– A source of biologically relevant motifs using
evolutionary data
• Match Augmentation (MA)– An algorithm for identifying geometric
similarity
• Statistical analysis– Statistically determined geometric thresholds
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Geometric Comparison Algorithms• Geometric Hashing
Wolfson H.J. et al. IEEE Comp. Sci. Eng., 4(4):10–21,1997.
• JESSBarker J.A. et al. Bioinformatics, 19(13):1644-9, 2003.
• PINTSStark A. et al. Journal of Molecular Biology,
326:1307-16, 2003.
• Many Others– webFEATURE, DALI, CE, SSAP…
• Our method: Match Augmentation– Integrate Structural and Evolutionary data– Efficient application of hashing and depth first search
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
PSB 2005
Geometric Comparison Strategy
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
• Biological Input: – A structure of a functional site (Motif)– A protein structure with unknown function
(Target)
• Geometric Search: – Find target atoms geometrically similar to motif
atoms, similar atoms and amino acids (Match)
• Output:– Match of atoms with greatest geometric similarity
• Might potentially identify a similar functional site in the target
Motifs: Structure & Evolution Data• Structure of a Functional Site
– Points in three dimensions (3D) taken from atom coordinates (motif point)
– Labeled by residue and atom identity
– Alternate residues from mutation
• Support for complex active sites
– Priority-ranked motif points• Functional relevance
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
{G,C,T} C
43
12
Input Data: Targets
• Targets– Points in 3D taken from atom
coordinates of whole protein structures (target points)
– Labeled by residue and atom identity
– No Alternate residues
– No ranking
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
{Y} C
Search: Matching Criteria
• Geometric Similarity– points are within when
optimally superimposed
• Label Compatibility– Target residue label is a
member of Alternative Residues– Atom labels identical
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
<
{S,L,T}
{S}
C
C
Matches
• Matches correlate motif points to target points– Bijection– Fulfill Geometric and Label Criteria
• Geometric Similarity measured by Least Root Mean Squared Distance (LRMSD)
• The match we seek:– Bijection of all motif points– Smallest LRMSD of all matches
considered
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Motif
Target
Match
Match Augmentation at a Glance
Input
SeedMatching
Augmentation
Output
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD
• Two Phases:– Seed Matching– Augmentation
Match Augmentation at a Glance
Input
SeedMatching
Augmentation
Output
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
31
2Match High Ranked Points
• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD
• Two Phases:– Seed Matching– Augmentation
Match Augmentation at a Glance
Input
SeedMatching
Augmentation
Output
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
31
23
1
2Expand matches to rest of Motif
• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD
• Two Phases:– Seed Matching– Augmentation
Match Augmentation at a Glance
Input
SeedMatching
Augmentation
Output
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
31
2
4
31
2Expand matches to rest of Motif
• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD
• Two Phases:– Seed Matching– Augmentation
Match Augmentation at a Glance
Input
SeedMatching
Augmentation
Output
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
31
2
4
531
2Expand matches to rest of Motif
• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD
• Two Phases:– Seed Matching– Augmentation
Filtering Completed Matches• Augmentation implements a depth first search:
• Data is stored in heap of matches• Final output: match with smallest LRMSD
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
No more pointsto match
LRMSD:2.41
Matches Sortedby LRMSD
Final OutputLRMSD: 0.87
Match Augmentation Summary
• Hybrid Algorithm– Seed Matching: Hashing – Augmentation: Depth First Search
• Finds matches to motifs within target structures– Final output: match with smallest LRMSD
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Testing MA on Biological data
• Data Set– 12 motifs selected from residues surrounding enzymatic
active sites – 73 targets, each evolutionarily related to one of the motifs– Details: www.cs.rice.edu/~brianyc/papers/PSB2005/
• Experimental Protocol– Search for each motif within every target.
• Matches of evolutionarily related motif-target pairs are “HPs” (BLUE)
• Matches of unrelated motif-target pairs are “NHPs” (RED)
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Match Augmentation Conclusions• Match Augmentation is accurate
– Identifies cognate active sites in 95.4% of evolutionarily related proteins
• Match Augmentation is very efficient– Matches can be found in a fraction of a
second
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Outline
• Evolutionary Trace (ET)– A source of biologically relevant motifs using
evolutionary data
• Match Augmentation (MA)– An algorithm for identifying geometric
similarity
• Statistical analysis– Statistically determined geometric thresholds
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Evaluating Statistical Significance• Hypothesis Testing Framework:
– H0: Motif and Target are functionally unrelated
– HA: Motif and Target are functionally related
• Reject H0 for a given match only if the match is unusual under H0.
• Problem: how do we evaluate the H0 for a given match?
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
The “Usual” H0 distribution
• The set of matches between the motif and all functionally unrelated targets
• Previous methods approximate this distribution:– JESS
• Matches are compared to a reference population of motifs is partially ordered by degree of occurrence
– PINTS• Approximate the distribution of matches with an artificial
curve parameterized by motif size and residue content.
• MA can calculate this distribution explicitly by computing matches to the entire PDB
• JESS: Barker J.A. et al. Bioinformatics, 19(13):1644-9, 2003.• PINTS: Stark A. et al. Journal of Molecular Biology, 326:1307-16, 2003.Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
A Distribution of Match LRMSDs
• LRMSD distribution of matches with entire PDB– Almost all known protein structures – Almost no functional relation to a our motifs
• Reasonable H0 Distribution
0 1 2 3 4 0 1 2 3 4 5
Unsmoothed Smoothed
LRMSD LRMSD
Frequency
Frequency
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
How unusual is our match?
• We want: the probability of observing a match with lower LRMSD than given match
RMSD
est
imat
ed d
en
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
LRMSD
Frequency
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Match LRMSD
A B
A: Area left of line
B: Area under curve
p-Value:A
Bp =
matches with lower LRMSD
matches total
• Apply P-value to reject H0
Statistical Significance
• Result: Data driven statistical significance value (p-value)– No dependence on approximations like previous work
• p-value of a match tells us the probability of observing another match with lower LRMSD, with a functionally unrelated target
• Apply p-value to reject H0
• Do matches identifying cognate active sites (HPs) have low p-values? (i.e. Can we reject H0 for HPs?)
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Testing our Statistical Analysis
• Distributions of matches over the PDB can be calculated efficiently– 12:48 on a single machine, on average
• Do not have to scan the entire PDB to accurately determine the H0 distribution– 5% random sample accurate enough– Reduces sample time to 0:38, on average
• Matches of cognate active sites (HPs) are statistically significant (low p-values)
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Conclusions
• Match Augmentation is accurate and extremely efficient– Correctly identifies cognate active sites (HPs)– Identifies matches in fractions of a second
• Algorithmic efficiency enables detailed Statistical Analysis– Explicitly calculate H0 distribution without dependence on
approximated H0 distributions
– Matches of cognate active sites (HPs) are statistically significant
– Significance threshold translates into useful motif-specific LRMSD thresholds
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki
Special Thanks• Kavraki Group
– David Schwarz– Amarda Shehu– Allison Heath– Hernan Stamati– Anne Christian– Drew Bryant– Amanda Cruess– Brad Dodson– Jessica Wu
• Lichtarge Lab– David
Kristensen– Dan Morgan– Ivana Mihalek– Hui Yao
• Kimmel Group– Viacheslav
Fofanov
• Funding– NSF– NLM
5T15LM07093– March of
Dimes– Whitaker
Foundation– Sloan
Foundation– VIGRE– AMD
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs
B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki