DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity, DNA-protein Interaction

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity, DNA-protein Interaction Patterns and GO Framework Ilya Shindyalov, UCSD/SDSC PhD, Group Leader, Protein Science Research DIMACS 2005-06-13

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Protein Essential Dataflow in Protein Science Sequence Structure Function Data: Methods: Sequence similarity: (i)BLAST, (ii)fold recognition, (iii) homology modeling Results: Structure similarity: (i)DALI, (ii)VAST, (iii) CE

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD COVERAGE RATIO FOR FUNCTIONAL ANNOTATION Disease Biological Cell Molecular Process Component Function PDB STRUCTURES 0.758 0.396 0.371 0.335 SG TARGETS 0.355 0.315 0.452 0.259 PDB+SG 0.822 0.528 0.593 0.477 HOMOLOGY MODELS 0.984 0.792 0.839 0.821 Do we know the function, if we know the structure?

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The Subjects of my Talk 3 Approaches of Using Structure Similarity to Infer Protein Function: #1: Assigning function from known to unknown CASE STUDY Prediction of calcium binding in Acetylcholine Esterase Projection on SNP responsible for Autism. #2: Classification of DNA-binding protein domains involving (in addition to structure similarity) DNA-protein interaction patterns and sequence similarity. #3: Extending GO annotation using structure similarity how reliable it can be? #4 [BONUS]: Why ontology is so important for humans?

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function from known to unknown CASE STUDY Prediction of calcium binding in Acetylcholine Esterase Projection on SNP responsible for Autism. #2: Classification of DNA-binding protein domains involving (in addition to structure similarity) DNA-protein interaction patterns and sequence similarity. #3: Extending GO annotation using structure similarity how reliable it can be? #4 [BONUS]: Why ontology is so important for humans?

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Protein structure comparison by Combinatorial Extension of the optimal path (Shindyalov and Bourne, 1998). http://cl.sdsc.edu

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Step 1. Heuristic search for initial path.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Step 2. Iterative dynamic programming on starting superposition from step 1.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE vs. other Algorithms ??? Novotny, M., Madsen, D., and Kleywegt, G.J. 2004. Evaluation of protein fold comparison servers. Proteins 54: 260-270.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD 2ACE vs. 1TN4: RMSD = 4.6 Z-score = 4.6 LALI = 86 LGAP = 8 Seq. Identity = 3.5% Acetylcholinestarase vs. Troponin C

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD PDB - Protein Data Bank of February 13, 2002 with 17,304 entries was used as the source of original structural data. - The DNA fragment size is at least 5 bp long. - At least 5 different protein residues are involved in the interaction with DNA. - The contact distance cutoff between interacting atoms was < 5. - We did not take into account the different types of DNA (A, B, Z) because of the insufficient level of this annotation in the PDB PDP Protein Domain Parser (Alexandrov, Shindyalov, Bioinformatics, submitted) CE Protein structure alignment by Combinatorial Extension (Shindyalov, Bourne, 1998) SCOP - Structure Classification of Proteins (Murzin et al., 1995) Data and algorithms used:

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD PDB Selection of DNA-binding protein chains by analyzing DNA-protein contacts Parsing of DNA-binding protein chains into domains using PDP Selection of DNA-binding protein domains by analyzing DNA-protein contacts All-against-all structural alignment of DNA-binding protein domains using CE Selection of representative (non-redundant) set of DNA-binding protein domains Calculating classification of DNA- binding protein domains Building representative set of domains:

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Rmsd, root mean squared deviation between two aligned and compared protein domains > 2.0 ; Z-score, statistical score obtained from CE is < 4.5; Rnar, ratio of the number of aligned residues to the smallest domain length < 90%; Note: sequence identity in the alignment < 90%; Parameters measuring structural similarity:

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Figure 2. Determination of matched protein-DNA contact pattern for two hypothetical DNA-protein domain complexes A and B structurally aligned to each other. All residues except those matched to - are considered aligned to each other. Stars denote residues involved in protein-DNA interactions. Vertical bars denote matched protein residues involved in interaction with DNA. (1)Parameters measuring structural similarity: Rmsd, Z-score, Rnar; (2) Parameter measuring the match between DNA-protein contact patterns, Rmat; A and B - DNA-protein domain complexes; Rmat = min{Rmat A, Rmat B } Rmat X - ratio of the number of matched residues to the total number of residues involved in contacts with DNA in the DNA-protein complex X.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Realignment using scoring function taking into account structural similarity between two protein domains and protein-DNA contact pattern Structure similarity term: Protein-DNA contact pattern term: where m denotes protein residue, X protein-DNA complex; C 3 is a scaling constant;

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD If Rmsd > 5.0 or Rnar < 70% or Z-score < 3.5, then domains are not considered as similar; If Rmsd 3.0 and Rnar 80%, then domains are considered as similar; If Rmat Rmat threshold and either: 3.0 < Rmsd 5.0 and Rnar 70% or 70% Rnar < 80% and Rmsd 5.0 , then domains are considered similar;

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Comparison of the classification for all 338 DNA-binding domain representatives with SCOP at various threshold parameters

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Final classification of DNA-binding protein domains (fragment):

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Rmsd Rnar 53 70 80 Similar if Rmat

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD For two polypeptides A and B with all calculated parameter values (Rmsd, Z-score, Rnar, Rseq) and given threshold values (Rmsd threshold, Z- score threshold, Rnar threshold, Rseq threshold ) we define: SSC AB =(Rmsd Z-score threshold ) (Rnar>Rnar threshold ) (Rseq>Rseq threshold ) - denotes logical AND. SSC AB can only be ascribed two values: true or false. If SSC AB is true, then A and B are similar. If SSC AB is false, then A and B are not similar. The chains were clustered such that for every two chains in each cluster the above condition (in red) holds true.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Specificity Criteria: For the clusters where GO terms were available for at least two chains we define: positive cluster - where all chains have the same GO terms; negative cluster - where chains have different GO terms (more specific definitions for three criteria will be given further); TP (true positives) - a number of chains with GO terms in the positive clusters; FP (false positives) - a number of chains with GO terms in the positive clusters; ppv (positive predictive value) or specificity is the following ratio - TP/(TP+FP)

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Further detailing of specificity (Specificity-4) should involve the semantic distance (e.g. Lord et al, 2003) between terms in judging cluster to be positive. Specificity Criteria (cont.): Specificity-3 (less rigorous than specificity-2) - positive cluster must have a common set of terms {t 1,..t L } for all N chains within the cluster: {t 1,..t N } {t i1,..t ik(i) }, i=1,N; {t 1,..t N } . Specificity-2 (less rigorous than specificity-1) - positive cluster must have for every pair of chains (i, j) with different number of GO terms the following: for the chain with a smaller number of terms all terms must be present amongst the terms for a chain with a larger number of GO terms: {t i1,..t ik(i) } {t j1,..t jk(j) }, if k(i) k(j); i {1,N}, j {1,N}; {t 1,..t N } . Specificity-1 (the most rigorous) - positive cluster must have every pair of chains (i, j) with the same set of GO terms: t in = t jn, n=1,k(i), k(i)=k(j), for (i, j), i {1,N}, j {1,N}. {t i1,..t ik(i) } - is a set of GO terms k(i) for i-th chain. Each specificity is defined for a clusters with at least two annotated chains.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Clusterization of PDB chains and the accuracy of GO annotation at different threshold values of structural similarity parameters.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Assignment of GO annotation with structural similarity parameters (Rmsd 5.0, Z-score 3.8, Rnar 70%, Rseq 90%). Red dot denotes newly annotated chains, red arrow denotes new GO term chain associations assigned for newly annotated chains. Purple line denotes new GO term chain associations assigned for chains previously annotated (by EBI). Black arrow denotes existing GO term chain associations assigned by EBI.

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The example of negative cluster by the definition of specificity-1 and positive cluster by the definitions of specificity-2 and specificity-3. Seven GO terms could be assigned to chains 1h9dA, 1h9dC (Rmsd 5.0, Z-score 3.8, Rnar 70%, Rseq 90%). 1e50A (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1e50C (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1e50E (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1e50G (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1e50Q (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1e50R (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1cmoA (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1co1A (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1ljmA (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1ljmB (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1hjbC (4) 3677, 5524, 5634, 6355, 1hjbF (4) 3677, 5524, 5634, 6355, 1hjcA (4) 3677, 5524, 5634, 6355, 1hjcD (4) 3677, 5524, 5634, 6355, 1io4C (4) 3677, 5524, 5634, 6355, 1eanA (4) 3677, 5524, 5634, 6355, 1eaoA (4) 3677, 5524, 5634, 6355, 1eaoB (4) 3677, 5524, 5634, 6355, 1eaqA (4) 3677, 5524, 5634, 6355, 1eaqB (4) 3677, 5524, 5634, 6355, 1h9dA no go terms 1h9dC no go terms 3677 (F) - DNA binding 3700 (F) - transcription factor activity 5524 (F) - ATP binding 5634 (C) - nucleus 6355 (P) - regulation of transcription, DNA-dependent 7275 (P) - development 8151 (P) - cell growth and/or maintenance The cluster of the same proteins which is Runt-related transcription factor 1 (synonyms: core-binding factor alfa subunit, acute myeloid leukemia 1 protein etc.).

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The example of positive cluster by definition of specificity-1. Phospholipase A2. 1cl5A (5) 4623, 5509, 15070, 16042, 16787, 1cl5B (5) 4623, 5509, 15070, 16042, 16787, 1fb2A (5) 4623, 5509, 15070, 16042, 16787, 1fb2B (5) 4623, 5509, 15070, 16042, 16787, 1fv0A (5) 4623, 5509, 15070, 16042, 16787, 1fv0B (5) 4623, 5509, 15070, 16042, 16787, 1jq8A (5) 4623, 5509, 15070, 16042, 16787, 1jq8B (5) 4623, 5509, 15070, 16042, 16787, 1jq9A (5) 4623, 5509, 15070, 16042, 16787, 1jq9B (5) 4623, 5509, 15070, 16042, 16787, 1kpmB no go terms 4623 (F) - phospholipase A2 activity 5509 (F) - calcium ion binding 15070 (F) - toxin activity 16042 (P) - lipid catabolism 16787 (F) - hydrolase activity

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Only four negative clusters have occurred by definition of specificity-3: An example of missed GO terms for 2mtaC and other chains of Cytochrome c-L (cytochrome c551i) 2mtaC (3) 5489, 6118, 15945 1mg2D (2) 16021, 16032, 1mg2H (2) 16021, 16032, 1mg2L (2) 16021, 16032, 1mg2P (2) 16021, 16032, 1mg3D (2) 16021, 16032, 1mg3H (2) 16021, 16032, 1mg3L (2) 16021, 16032, 1mg3P (2) 16021, 16032, 5489 (F) - electron transporter activity 6118 (P) - electron transport 15945 (P) - methanol metabolism 16021 (C) - integral to membrane 16032 (P) - viral life cycle

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Evolution of complex systems : Computers: complexity doubles in every 18 month per $$$ (Moores Law) Human Brain: very slow (complexity doubles in ~100,000 years) SystemShort Term Storage Long Term Storage SpeedCost PC cluster (256 units) 65GB5 TB256 GFLOP$130K Human Brain (Average) 57 TB1137 TB4.4 TFLOP$130K Complexity = Speed x Memory Computer = 5TB x 256 GFLOP = 10 24 memory FLOPs Brain = 1137TB x 4.4 TFLOP = 5x10 27 memory FLOPs Brain/Computer=5x10 3 or 3.7 log units Moores Law: 3.5 years/log unit Based on (Ramsey, 1997) Human brain capacity for computers will be reached: 2000+3.7x3.5=2013

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD

The accuracy of predicting the future for the next 2 years equals 10%

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Credits: Julia Ponomarenko (she did #2 and #3) Phil Bourne (discussions, conceptualizations, logistics) Lei Xie (PDB statistics) NIH Grant GM63208 NSF Grants DBI 9808706, DBI 0111710 Gift from Ceres Inc.

Documents

DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity, DNA-protein Interaction